Spatio-temporal clustering code

larsoner commented 12 years ago

Now that I'm getting a hang of mne-python's codebase, I'd like to port the non-parametric spatio-temporal clustering code I've written in MATLAB/C (mex files) over. It's going to be a huge pain, but probably worth it. At least that's what KC tells me...

Anyway, I want to discuss how to implement it here so that I can hopefully construct it correctly from the start.

Let me start with what I have in MATLAB.

The code takes data are in an nsrc x ntimes x nsubjects x 2 array and computes a non-parametric permutation test over the 2 conditions using a paired, 2-tailed t-test. In each permutation, a 2-tailed t test is performed and thresholded at p < 0.05 to obtain a set of putatively significant indices (from the nsrc x ntimes possible locations). The heart of the code, the spatio-temporal clustering, takes these indices and a cell array of "neighbors" and returns the N clusters that result. To get the "neighbors", I read in a source space after running "'mne_add_patch_info --dist 7 --src %s --srcp %s" to add vertex distances (usually on fsaverage).

So the question is, how should I structure the functions in python?

There are (at least) 3 that are essential: 1) A nonparametric test function 2) A spatio-temporal clustering function 3) A source-space neighbor loading/calculation function

What should I call these functions, what arguments should they accept, what .py files should I stick them in?

Incidentally, calculating the geodesic neighbor distances can take some time, so I think we should offer people the ablility to save those.

larsoner commented 12 years ago

I should also mention that I've done a bunch of work in MATLAB to optimize these functions for speed. I'm going to do the same thing in Python. On my 6-core machine, sometimes it takes an hour to perform the full permutation test, and that's with full parallelization.

larsoner commented 12 years ago

Looks like a lot of the components I'll need are already in the code (hooray!), perhaps just by making use of permutation_cluster_test (I think). I'll keep looking into it to see if it will do what I think it does...

agramfort commented 12 years ago

yes there is already code for this. Ellen has used my script for one of her studies:

https://github.com/KuperbergLab/MEG_scripts/blob/master/source_compute_cluster_stats.py

it's still a bit alpha code but it works with fsaverage and maybe any ico surface.

larsoner commented 12 years ago

Excellent. I'll compare it to my code to see if there are any differences in the output.

larsoner commented 12 years ago

So far I noticed one thing was different. Instead of defining neighbors directly from the mesh, I used a parameter to define "neighbors" based on their being within some maximal geodesic distance of one another. Is there code for that written already? I basically pulled the information from the source space computed with "mne_add_patch_info --dist 7", but it would be excellent if we could keep it so you only needed to use Python.

Along those lines, I think it could make sense to extend the spatio_temporal_src_connectivity function to have an optional third parameter for the maximum distance between vertices to call them "neighbors". Leaving the third parameter as the default "None" would then use the mesh---the current behavior---or allow the user to use geodesic distances (what I had used). How do you think on it?

larsoner commented 12 years ago

I also think it probably makes sense to make a function that just takes the data and various parameters to make spatio-temporal clustering more accessible.

something like:

connectivity = spatio_temporal_src_connectivity(src=None, n_times, dist=None)
# src=None would default to using fsaverage
spatio_temporal_clustering(X, connectivity=connectivity, stat_fun=None, tail=0, threshold=0.05, return_threshold=None, seed=0, n_permutations=None, do_permutations=True, stat_power=0)

X and connectivity are the [subjects x times x vertices] data (e.g., a difference between conditions) and connectivity matrices
connectivity==None defaults to computing the default connectivity (fsaverage and dist=None)
stat_fun==None defaults to a ttest (used by permutation_cluster_1samp_test)
tail, and seed are also basically passed on
threshold is for the first (primary) thresholding
return_threshold==None returns all clusters found, return_threshold==0.05 (or whatever) returns only those above a specific threshold
do_permutations==True actually performs a full permutation test and gives you p-values; ==False would just cluster the non-permuted case and return those (useful for making sure data are set up correctly, etc.)
stat_power is the power to raise the abs(statistical_value) for each vertex by to score the clusters. note that stat_power=0 yields the # of vertices, while stat_power=1 yields the sum of absolute value of the t values (or whatever the statistic is)

There are probably other parameters, but let me know what you think of this general idea.

larsoner commented 12 years ago

Examining the permutation test code, it looks like it performs a random / monte carlo resampling (exchanging conditions) as opposed to a permutation test, which goes through all 2^n_subjects possible permutations. Is that correct? If so, I'd be up for modifying the code such that it does a permutation test (doing /all/ combinations of exchanges) when n_perms >= 2^(n_subjects). This will achieve an exact test when it's possible, which is nice.

larsoner commented 12 years ago

One other thing I noticed is that in the _find_clusters code, for the two-tailed tests (which would be the only ones to have this be a concern), it does not consider whether the vertices have the same sign or not when adding them to a cluster, and it seems to me that it should. In other words, if I'm computing condition A-B, vertices for which A>B should not be clustered with those where B>A. Does this make sense? If so, I can try to modify the code in this way. If not, perhaps we can add an option?

larsoner commented 12 years ago

When I coded my version of clustering, I noticed that since comparisons are against effectively the abs(max()) from each permutation (e.g., lines 127-128 in cluster_level.py), once you've done one permutation---say, 001010 for flipping conditions for 8 subjects---you've effectively also done 110101, so you can get away with doing half the tests, since the opposite permutation will yield the same maximal statistic. If you agree, I'm happy to re-code it to take this into account, especially if I'm also re-coding it to do a full permutation test when possible (two comments above).

EDIT: I meant to say that this is only applies for two-tailed tests...

agramfort commented 12 years ago

connectivity = spatio_temporal_src_connectivity(src=None, n_times, dist=None)

src=None would default to using fsaverage

spatio_temporal_clustering(X, connectivity=connectivity, stat_fun=None, tail=0, threshold=0.05, return_threshold=None, seed=0, n_permutations=None, do_permutations=True, stat_power=0)

X and connectivity are the [subjects x times x vertices] data (e.g., a difference between conditions) and connectivity matrices connectivity==None defaults to computing the default connectivity (fsaverage and dist=None) stat_fun==None defaults to a ttest (used by permutation_cluster_1samp_test)

ok for defaults but it could be explicit on the function signature.

tail, and seed are also basically passed on threshold is for the first (primary) thresholding return_threshold==None returns all clusters found, return_threshold==0.05 (or whatever) returns only those above a specific threshold do_permutations==True actually performs a full permutation test and gives you p-values; ==False would just cluster the non-permuted case and return those (useful for making sure data are set up correctly, etc.) stat_power is the power to raise the abs(statistical_value) for each vertex by to score the clusters. note that stat_power=0 yields the # of vertices, while stat_power=1 yields the sum of absolute value of the t values (or whatever the statistic is)

There are probably other parameters, but let me know what you think of this general idea.

sounds reasonable.

FYI in fieldtrip the randomization code accepts as parameters ids for samples (subjects in your mind) which say what permutation / resamplings are valid. eg. if they have same id samples should stay in the same group. It makes their code really generic.

agramfort commented 12 years ago

Examining the permutation test code, it looks like it performs a random / monte carlo resampling (exchanging conditions) as opposed to a permutation test, which goes through all 2^n_subjects possible permutations. Is that correct? If so, I'd be up for modifying the code such that it does a permutation test (doing /all/ combinations of exchanges) when n_perms >= 2^(n_subjects). This will achieve an exact test when it's possible, which is nice.

you're right.

agramfort commented 12 years ago

One other thing I noticed is that in the _find_clusters code, for the two-tailed tests (which would be the only ones to have this be a concern), it does not consider whether the vertices have the same sign or not when adding them to a cluster, and it seems to me that it should. In other words, if I'm computing condition A-B, vertices for which A>B should not be clustered with those where B>A. Does this make sense? If so, I can try to modify the code in this >way. If not, perhaps we can add an option?

you're right.

When I coded my version of clustering, I noticed that since comparisons are against effectively the abs(max()) from each permutation (e.g., lines 127-128 in cluster_level.py), once you've done one permutation---say, 0001010 for flipping conditions for 8 subjects---you've effectively also done 110101, so you can get away with doing half the tests, since the opposite permutation will yield the same maximal statistic. If you agree, I'm happy to re-code it to take this into account, especially if I'm also re-coding it to do a full permutation test when possible (two comments above).

you're right again

larsoner commented 12 years ago

When I used a dataset that was 20484 x 2 time points, the clustering took 8 seconds. Going to 4 time points took it to 28 seconds, and going up to 8 time points took 97 seconds. In other words, it looks like the time goes up with the square of the number of time points. This makes sense since the connectivity matrix is structured as (n_times * n_vertices) \ 2.

This is problematic for me, however, since I'm used to clustering data that is 20424 vertices by 200 (or more) time points. And even with 24 GB of memory, I ran out of memory just doing the initial clustering, let alone how long it would take :(

I got around this problem in my MATLAB by exploiting the fact that only immediately adjacent time points (or time points within some specified number of steps) would be called neighbors. This correspondingly makes the time and memory requirements go up linearly with the number of time points (although they still go up with the square of the number of vertices).

Would it be alright if I implemented an alternative method for clustering when data is known to be structured as space x time? Looking at the code, I think it can be done in some sensible way.

larsoner commented 12 years ago

Closing to move the discussion to the WIP PR.

mne-tools / mne-python

Spatio-temporal clustering code #131

src=None would default to using fsaverage