ENH: stats for ERF decoding and time generalization

dengemann commented 9 years ago

With @kingjr we're currently thinking about the right way to handle cross-trial significance testing for the analyses demonstrated in our decoding examples. This is basically overlapping with one of our GSOC topics, and the discussion in the comments over here. Now practically, permutation testing or bootstrapping seems pretty heavy, if it means that you resample every model you fit. This will render these kind of analyses intractable. On the flip side there seems to be a problem in using any test that makes assumptions about the degrees of freedom, as for single trial probability output, for example, the samples are not independend because of cross-validation. It seems, instead of bootstrapping the entire decoding or incorporating permutation testing into cross-validation, as suggested by Martin Hebart (first comment here), we could just compute a permutation test post-hoc on the predictions. This would be much more tractable.
I was wondering what you think might be the right way to go. Also for the GSOC project it would be nice to have some discussion about this.

cc @agramfort @Eric89GXL @banilo

dengemann commented 9 years ago

cc @deep-introspection

dengemann commented 9 years ago

cc @flrgsr

deep-introspection commented 9 years ago

@dengemann I like Hebart's solution but this may be computationally demanding... By the way, to what extend MNE is handling distributed computing?

dengemann commented 9 years ago

By the way, to what extend MNE is handling distributed computing?

DYI style ...

agramfort commented 9 years ago

with joblib

dengemann commented 9 years ago

with joblib

@agramfort I think @deep-introspection is referring to distributed computing across machines, MPI style

deep-introspection commented 9 years ago

@dengemann Indeed, but thanks @agramfort for the joblib tip, sounds pretty cool!

agramfort commented 9 years ago

see clusterlib it's WIP though

dengemann commented 9 years ago

some more food for thoughts:

http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=21D16E681FF16B0BA91D741A63806A31?doi=10.1.1.29.5194&rep=rep1&type=pdf

http://epub.ub.uni-muenchen.de/4134/1/tr030.pdf

dengemann commented 9 years ago

Salzberg says:

"(It may be tempting, because it is easy to do with powerful computers, to run many cross validations on the same data set, and report each cross validation as a single trial. However this would not produce valid statistics, because the trials in such a design are highly interdependent.)."

On the otherhand this is essentially what Eugster and colleagues propose and what is implemented in the caret resamples function If I'm not mistaken. It essentially computes confidence intervals across resamples based on a t-statistic. Here nuances refer to using bootstrap for model comparison instead of e.g. a repaeated 10-fold CV.

(cc @topepo)

dengemann commented 9 years ago

More materials:

http://stats.stackexchange.com/questions/17408/how-to-assess-statistical-significance-of-the-accuracy-of-a-classifier

dengemann commented 9 years ago

cc @cmoutard

choldgraf commented 8 years ago

Seeing if you guys thought up anything on this issue. I'm trying to gauge whether my decoding scores are "significant" or not as has been mentioned here, doing permutation testing on something like this would take a long time :)

topepo commented 8 years ago

Without knowing too much about the details of the problem, my thoughts are:

comparisons between resamples is probably preferable since it can drastically improve the power of the comparison (relative to a single static test set). Controlling for the resample-to-resample variability, which can be huge, is easy to do.
permutation tests, depending on how yo do them, tend to test the hypothesis that the signal is above the noise. If you are using them to evaluate a single model, knowing that the performance is better than noise can be a very low bar and measuring the uncertainty of the performance metric (via confidence or credible intervals) is probably a better idea. A permutation test might be more applicable if you jointly permute two models and see if the difference in performance between them is above the experimental noise.

choldgraf commented 8 years ago

That sounds reasonable - though how is resampling possible within the current infrastructure for this class? It seems like right now we fit the model, and then have one call to score that will give the generalization matrix. It's not clear to me how resampling could be incorporated into that work-flow as is (maybe you'd have to run "score" for each iteration of the cross-validation loop or something?)

dengemann commented 8 years ago

@choldgraf let's keep the API issue aside for a moment. @topepo thanks a lot for your highly appreciated feedback. But as far as I understand it and as it seems with your R code, the resampling you would suggest makes the assumption that the boostraps / folds / resamples are independent but they are not. The degrees of freedom will hence be erroneous. Do I overlook something?

topepo commented 8 years ago

You are correct about the independence. For example, two bootstrap samples of the first ten integers would might be

1, 1, 1, 1, 2, 2, 4, 6, 7, 9
1, 1, 2, 2, 3, 4, 4, 7, 9, 10

so that the holdouts are:

3,  5,  8, 10
5, 6, 8

Since there are common samples in the holdouts, they are not independent. It is possible to use something like a linear mixed model to account for the within-sample correlations between the resampling statistics generated by these holdouts. However, the resample-to-resample variation is typically huge in comparison and accounting for that will deal with the lion's share of the correlation structure.

The simplest thing to do is to compare two models by testing the differences in their performance at the resample level. So if resample holdout #1 has an accuracy of 90% for model A and 95% for model B, evaluate the difference of 5% Using all of the resample-specific accuracy differences we can test that their average is zero. That's basically a paired t-test. Ignoring the sample-to-sample correlation reduces the power of the test but you can probably compensate for that by doing more resamples (if that is a concern).

A model that accounts for both the resample-to-resample and sample-to-sample correlations would be very hard to parameterize and estimate.

dengemann commented 8 years ago

The simplest thing to do is to compare two models by testing the differences in their performance at the resample level. So if resample holdout #1 has an accuracy of 90% for model A and 95% for model B, evaluate the difference of 5% Using all of the resample-specific accuracy differences we can test that their average is zero. That's basically a paired t-test. Ignoring the sample-to-sample correlation reduces the power of the test but you can probably compensate for that by doing more resamples (if that is a concern).

So in other words, we alleviate this problem by slightly modify the question. A further trick could then be to test e.g. a a "serious" model against a dummy-classifier (predict top category or predict other silly pattern). But then I still don't see the real difference between doing a paired t-test and a one sample t-test for a given model, e.g. setting the chance level to zero by subtracting 0.5 for a 2 class case. Intuitively, and this is what people have recently argued (see refs above), this does not permit to correctly perform NHST, as the statistics will be erroneous (samples are included in multiple folds, which is usually not the case for e.g. experimental statistics where paired tests apply).

topepo commented 8 years ago

I still don't see the real difference between doing a paired t-test and a one sample t-test for a given model

For simple 2-group comparisons, there isn't too much of a difference.

the statistics will be erroneous

Well yes but that is a pretty black and white way of viewing it.

With the t-test analogy, a person could do a regular t-test under an experimental design where there is a correlation structure. The result is that the variance of the difference is artificially inflated (bad). However, suppose that there is missing data and many of the differences cannot be estimated. In that case, it might be better to ignore the correlation since the loss of power due to missing data is worse than overestimating the variance of the difference. As another example, when the correlation is near zero, there is literature that claims that you are better off (statistically) not estimating the covariance parameter at all.

I don't think that a comprehensively "correct" model can be estimated from these data, so we can fit one that has the smallest approximation error. "All models are wrong..."

dengemann commented 8 years ago

With the t-test analogy, a person could do a regular t-test under an experimental design where there is a correlation structure. The result is that the variance of the difference is artificially inflated (bad). However, suppose that there is missing data and many of the differences cannot be estimated. In that case, it might be better to ignore the correlation since the loss of power due to missing data is worse than overestimating the variance of the difference. As another example, when the correlation is near zero, there is literature that claims that you are better off (statistically) not estimating the covariance parameter at all. I don't think that a comprehensively "correct" model can be estimated from these data, so we can fit one that has the smallest approximation error. "All models are wrong..."

I sympathise a lot with this view. Just trying to think what we can recommend users who have to deal with reviewers from e.g. psychology and biology. I think the catch is that they would argue that this procedure is not conservative enough and generates to many positives. How is your "real-world" impression from industry use cases regarding resampling testing on this one? Thanks for the discussion Max.

topepo commented 8 years ago

How is your "real-world" impression from industry use cases regarding resampling testing on this one?

Honestly, nobody has ever questioned it. That may not be related to whatever level of statistical rigor that it has; people don't seem to give a @$%^ about this aspect of modeling =]

Internally, during project discussions I think that the focus is on the applicability of the model. Externally, most of the reviewers who are not in JMLR or similar journals don't know much about the subject at all and focus their questions on bigger picture issues (e.g. "why didn't you just use logistic regression?")

dengemann commented 8 years ago

Honestly, nobody has ever questioned it. That may not be related to whatever level of statistical rigor that it has; people don't seem to give a @$%^ about this aspect of modeling =]

:)

Internally, during project discussions I think that the focus is on the applicability of the model. Externally, most of the reviewers who are not in JMLR or similar journals don't know much about the subject at all and focus their questions on bigger picture issues (e.g. "why didn't you just use logistic regression?")

yeah. good point, seems familiar.

kingjr commented 8 years ago

I'm getting a bit lost in your arguments; where do we go from here?

Also, what would be our typical use-case for across-trials statistics? ECoG (because electrode location are subject specific)? Anything else?

Seeing if you guys thought up anything on this issue. I'm trying to gauge whether my decoding scores are "significant" or not as has been mentioned here, doing permutation testing on something like this would take a long time :)

Can't you use second-order stats across subject? i.e.

scores = list()
for subject in subjects:
    clf.fit(subject)
    scores.append(clf.score(subject))
p_value = stats(np.array(scores) - chance_level)

banilo commented 8 years ago

My question would be: What is the null hypothesis that you would like to reject to establish statistical significance in a cross-trial setting?

kingjr commented 8 years ago

What is the null hypothesis that you would like to reject to establish statistical significance in a cross-trial setting?

On the assumption that the recorded individual is representative of the population, what is our confidence that the understudied neural marker X reliably correlates with the experimental condition y.

choldgraf commented 8 years ago

@kingjr ah I should have been more clear. The use case I was talking about was ECoG - I don't usually do statistics across subjects, it's almost always at the electrode level (or maybe within a subject across electrodes). So in my mind, for a single electrode, I'd like to be able to put confidence bounds on the diagonal scores at least. I know that doing this is kind of murky statistical territory (because it depends on the kinds of cross validation you do etc).

In terms of null hypotheses, I think that's a tricky question as people have already alluded to here. Just saying "is the reported score different from chance level" seems like a too liberal stat. Maybe some kind of "dumb" classifier as @dengemann suggests, training the same classifier on baseline data, or doing permutation testing to get an upper/lower bound on your null scores. But that's starting to get pretty heavy computationally...

banilo commented 8 years ago

Indeed, in a classification setting, your null hypothesis could be the performance of a dummy classifier (always choosing the majority class or always choosing a class at random).

That may be a fairer comparison...

2015-12-19 18:23 GMT+01:00 Chris Holdgraf notifications@github.com:

@kingjr https://github.com/kingjr ah I should have been more clear. The use case I was talking about was ECoG - I don't usually do statistics across subjects, it's almost always at the electrode level (or maybe within a subject across electrodes). So in my mind, for a single electrode, I'd like to be able to put confidence bounds on the diagonal scores at least. I know that doing this is kind of murky statistical territory (because it depends on the kinds of cross validation you do etc).

In terms of null hypotheses, I think that's a tricky question as people have already alluded to here. Just saying "is the reported score different from chance level" seems like a too liberal stat. Maybe some kind of "dumb" classifier as @dengemann https://github.com/dengemann suggests, training the same classifier on baseline data, or doing permutation testing to get an upper/lower bound on your null scores. But that's starting to get pretty heavy computationally...

— Reply to this email directly or view it on GitHub https://github.com/mne-tools/mne-python/issues/1848#issuecomment-166005542 .

choldgraf commented 8 years ago

So always choosing a random class sounds rather nice to me, because that would be super fast to simulate :) but sadly ease of simulation does not make for statistical rigor, so I'm probably not qualified to determine if that's legit or not

kingjr commented 8 years ago

always choosing the majority class or always choosing a class at random

One of the issues is that there is often more than one ways of being dumb (this applies outside the present topic too): in an imbalanced scenario, choosing the majority class or a random class gives you two different chance level.

I think the discussion confirms the idea that only label permutations (i.e. [shuffle, fit score] x n_perms) is the only formally valid approach, but it is very computationally extremely expensive. In practice, I suggest we i) indicate theoretical chance level, ii) do a bunch of permuted fit on a subset of the data (typically, on baseline only) and/or iii) ignore the issue and explicitly indicate in the text that we ignore the non-independence generated by the cross val.

banilo commented 8 years ago

One of the issues is that there is often more than one ways of being dumb (this applies outside the present topic too): in an imbalanced scenario, choosing the majority class or a random class gives you two different chance level.

I agree. One could perhaps compute both chance levels and then pick the more conservative one as null hypothesis (i.e., the one with the higher accuracy score).

kingjr commented 8 years ago

Some more materials: @qnoirhomme has a paper relevant to this issue.

kingjr commented 8 years ago

Kai goergen and Martin Hebart argue in favor of using prevalence testing instead of second order: https://arxiv.org/abs/1512.00810, they have a matlab implementation in their toolbox

kingjr commented 8 years ago

Kriegeskorte et al use bootstrap, and add a 'noise ceiling' estimate: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3990488/

It would be great if these stats functions were implemented here too, if anyone is interested.

jona-sassenhagen commented 5 years ago

This still seems like an important topic to me. @kingjr @dengemann etc

dengemann commented 5 years ago

We can do an example of bootstrap / permutations. Not so high priority to me. Doing many splits and showing the intrinsic uncertainty of the cross validation is also often good enough: https://mne-tools.github.io/mne-r/articles/plot_machine_learning_evoked.html

On 5 May 2019, at 13:32, jona-sassenhagen notifications@github.com wrote:

This still seems like an important topic to me. @kingjr @dengemann etc

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

MaryZolfaghar commented 4 years ago

I just want to check if anyone thought up any idea on this issue. I am also trying to analyze my decoding scores (using SVM for two classes) and see whether the scores are significant or not using permutation testing and cluster correction.

agramfort commented 4 years ago

if you have decoding scores as time series and change level in 0.50 you could do t-test at the group level and the cluster stat functions we have could take as input the time series subtracted from change ie 0.5. t-test would test that mean is higher than 0.5

mne-tools / mne-python

ENH: stats for ERF decoding and time generalization #1848