Within-subject confidence intervals

jona-sassenhagen commented 5 years ago

E.g., for plot_compare_evokeds: we could have inferential CIs, i.e., so that non overlap of CIs indicates significance.

Ideally with correction for multiple contrasts or something ...

See: https://www.mattcraddock.com/blog/2016/11/28/erp-visualization-within-subject-confidence-intervals/

agramfort commented 5 years ago

do we have data to demo this in an example?

jona-sassenhagen commented 5 years ago

I fear not.

larsoner commented 5 years ago

I naively expected you would just need one or two conditions for a single subject, and multiple trials, e.g., the sample dataset with any two conditions would work. No?

larsoner commented 5 years ago

Ahh no never mind I see it has to do with actually having multiple subjects.

jona-sassenhagen commented 5 years ago

I naively expected you would just need one or two conditions for a single subject, and multiple trials, e.g., the sample dataset with any two conditions would work. No?

That is in fact how we currently visualise the CI functionality.

Ahh no never mind I see it has to do with actually having multiple subjects.

Sorry, the proposal was badly underdescribed :)

larsoner commented 5 years ago

I think more PEBKAC of the issue viewer, not the issue opener in this case...

mmagnuski commented 5 years ago

that would be cool, I was recently expaining this to students - that the CIs in mne are best for across-subjects not within-subjects comparisons.

jona-sassenhagen commented 5 years ago

@JoseAlanis e.g. here.

JoseAlanis commented 5 years ago

Ok. Thanks for the ref. It sounds doable. I've seen this approach before and just for comprehension: Essentially what one would do is force participants to have the same average (across conditions) to remove between-subject variability, correct?. The CI function called by plot_compare_evokeds() uses bootstrap to find the CIs. I guess it would be the same to just run it the on the previously "normalised" data, right? Since there is no sample data for this example, I think I can run some tests on a small toy dataset I usually use for demos. But It would probably be better to provide a dataset that everybody can use on the long run. I would be happy to work on it, if no one else has started.

jona-sassenhagen commented 5 years ago

I assume this one would be fine too: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.457.7783&rep=rep1&type=pdf

It would be more directly applicable, because it wouldn't require having single-trial data for each subject, and thus wouldn't require an API change.

dengemann commented 5 years ago

@jona-sassenhagen @larsoner @mmagnuski @JoseAlanis what is your general idea here?

We already have a private percentile bootstrap method that could be used for this, see for example: https://www.martinos.org/mne/stable/auto_examples/time_frequency/plot_time_frequency_global_field_power.html#sphx-glr-auto-examples-time-frequency-plot-time-frequency-global-field-power-py. But we could also use a parametric bootstrap where we sample from a probability model, e.g. a Gaussian with the estimated mean and variance params or even provide analytic CIs. Perhaps making a CI function could be interesting?

jona-sassenhagen commented 5 years ago

plot_compare_evokeds already uses your CI functionality :) The idea is to have inferential CIs. What kind, I think everything would be good. I would start with something simple.

I forgot that the functionality in the opener requires hierarchical data. Maybe Tryon's inferential CIs would be a good start (although it's not quite sure how useful they are absent FDR correction in this case) - they would not require much API change, maybe even none.

dengemann commented 5 years ago

The idea is to have inferential CIs.

What does that mean? The CIs are inferential.

jona-sassenhagen commented 5 years ago

For the difference from zero - not necessarily for the difference to another condition.

dengemann commented 5 years ago

For the difference from zero - not necessarily for the difference to another condition.

Note that our example from above is for a single condition (baseline comparison).

jona-sassenhagen commented 5 years ago

Yes sure - so that is a case that's covered. What's not covered is, e.g., plotting two conditions and having non-overlapping CIs directly correspond to a significant difference. Right?

dengemann commented 5 years ago

Yes sure - so that is a case that's covered. What's not covered is, e.g., plotting two conditions and having non-overlapping CIs directly correspond to a significant difference. Right?

That's just a matter of doing an example, no? Bootstrap the difference of 2 conditions over time. The difference will ensure that it is paired and you have no crossing overlaps that hide the correlated uncertainty.

palday commented 5 years ago

For the difference from zero - not necessarily for the difference to another condition.

Actually, both. The CI is on the mean itself, not on the difference (unless you've computed a difference wave); the inferential step (e.g. checking overlap) doesn't need a special CI. For example, a 5% significant difference between two means is given by non-overlap of the 83% CIs for each mean.

jona-sassenhagen commented 5 years ago

That's just a matter of doing an example, no? Bootstrap the difference of 2 conditions over time. The difference will ensure that it is paired and you have no crossing overlaps that hide the correlated uncertainty.

An example would be ok I guess - just doing a difference wave and plotting it with, e.g., plot_compare_evokeds. Which is similar to, but not the same as, having the two conditions as one line + band each.

Actually, both. The CI is on the mean itself, not on the difference itself; the inferential step (e.g. checking overlap) doesn't need a special CI. For example, a 5% significant difference between two means is given by non-overlap of the 83% CIs for each mean.

Yes, something like that might be enough for a start too.

dengemann commented 5 years ago

@jona-sassenhagen for 2 conditions in one subject we actually don't have a paired statistic as the conditions are exchangeable. So you could just bootstrap each condition like we did last week in FFM.

mmagnuski commented 5 years ago

a 5% significant difference between two means is given by non-overlap of the 83% CIs for each mean.

That's a different issue actually. The problem with the normal CI's is with within-subject designs (in the simplest case) - they take into account the variance across subjects and not the variance of the difference between conditions (so the CIs are too fat and do not show what one is interested in).

palday commented 5 years ago

The bigger issue (CIs on grand means and differences in within vs. between designs) is non-trivial for anything but balanced data ... and for balanced data, it's quite trivial -- it's just the CI computed by treating the means as observations (see e.g. here). And that's ignoring the issue of item variation for fields (such as language) where there are meaningful items, a problem which has been known since the 1970s.

palday commented 5 years ago

@jona-sassenhagen for 2 conditions in one subject we actually don't have a paired statistic as the conditions are exchangeable. So you could just bootstrap each condition like we did last week in FFM.

For the paired case, the CI computed on difference wave should suffice, no? Doesn't paired t-test just reduce to the one-sample t-test computed on the differences?

dengemann commented 5 years ago

I think we also have to be clear what we want here. Having a quick way of summarizing the effects and the uncertainty in the viz is great. Users may want to add permutation tests for exact p-values and/or proper modeling. Why don't we do a CI function that implements a few common options and expose it in a few examples?

jona-sassenhagen commented 5 years ago

For the paired case, the CI computed on difference wave should suffice, no? Doesn't paired t-test just reduce to the one-sample t-test computed on the differences?

Yes, but given that, there is still a difference between plotting a difference wave with a band for its divergence from zero, and plotting two waves with bands relating to the significance of their difference.

Maybe most generally speaking, a common use case is plotting multiple lines and somehow indicating when they are, by some measure or other, different from each other, to a statistically significant degree. That can mean many things, and have many solutions.

a CI function that implements a few common options and expose it in a few examples?

Sure:

https://github.com/mne-tools/mne-python/blob/master/mne/stats/permutations.py#L131

dengemann commented 5 years ago

For the paired case, the CI computed on difference wave should suffice, no? Doesn't paired t-test just reduce to the one-sample t-test computed on the differences?

Yes. That is my point. On the other hand you do not have paired difference between conditions in single subjects, only between say baseline and post-baseline.

dengemann commented 5 years ago

Yes, but given that, there is still a difference between plotting a difference wave with a band for its divergence from zero, and plotting two waves with bands relating to the significance of their difference.

@jona-sassenhagen 2 conditions would be tested classically with t-tests for independent samples. The 2-sample bootstrap would be appropriate here and no difference wave would be needed.

jona-sassenhagen commented 5 years ago

@jona-sassenhagen 2 conditions would be tested classically with t-tests for independent samples. The 2-sample bootstrap would be appropriate here and no difference wave would be needed.

I'm here not proposing anything for testing differences, but for visualisation.

Yes, we already have good code for visualising difference waves. That's not what I mean.

dengemann commented 5 years ago

I'm here not proposing anything for testing differences, but for visualisation.

If you bootstrap say the mean over channels for each condition within a subject, then non-overlap means "significant".

dengemann commented 5 years ago

https://github.com/mne-tools/mne-python/blob/master/mne/stats/permutations.py#L131

@jona-sassenhagen Yes my point was to make it public and add a few more options.

palday commented 5 years ago

For the paired case, the CI computed on difference wave should suffice, no? Doesn't paired t-test just reduce to the one-sample t-test computed on the differences?

Yes, but given that, there is still a difference between plotting a difference wave with a band for its divergence from zero, and plotting two waves with bands relating to the significance of their difference.

Maybe most generally speaking, a common use case is plotting multiple lines and somehow indicating when they are, by some measure or other, different from each other, to a statistically significant degree. That can mean many things, and have many solutions.

Not least of which the difference between "there is at least one significant pairwise difference" and more general "tests of linear hypotheses that e.g. not all terms are simultaneously zero/some other value".

I don't think there is a good way to plot the latter beyond underlining segments where the (multiple-comparisons corrected) ANOVA/preferred statistical test result was significant.

For within-subject CIs of the form suggested by Craddock -- the easiest way would be to extend the current API to allow a dict (conditions) of lists (subjects) of lists (trials) of Epochs. This would be a non-breaking change. Potentially add an additional flag within=False that would get passed to the CI function and show the difference between within=False and within=True in the example.

But if we're really going to start worrying about such things, then we should probably also start worrying about AR error as well (because otherwise you have MC in time) .... and item variation for fields where that is a thing. In that sense, I think encouraging too much inference based on the CIs alone is a bit naive.

dengemann commented 5 years ago

For within-subject CIs of the form suggested by Craddock -- the easiest way would be to extend the current API to allow a dict (conditions) of lists (subjects) of lists (trials) of Epochs. This would be a non-breaking change. Potentially add an additional flag within=False that would get passed to the CI function and show the difference between within=False and within=True in the example.

@palday I would just call the current CI function twice, once for each condition if this is the interest. For more complex designs involving models things will look a bit more complex. But a parametric bootstrap can do the job there.

jona-sassenhagen commented 5 years ago

For within-subject CIs of the form suggested by Craddock -- the easiest way would be to extend the current API to allow a dict (conditions) of lists (subjects) of lists (trials) of Epochs. This would be a non-breaking change. Potentially add an additional flag within=False that would get passed to the CI function and show the difference between within=False and within=True in the example.

Actually, that sounds quite good to me! I think that by itself could be quite interesting. What do you think @mmagnuski ?

Might also be just at the level @JoseAlanis could quickly do, although we might not have the right data for an example.

I don't think there is a good way to plot the latter beyond underlining segments where the (multiple-comparisons corrected) ANOVA/preferred statistical test result was significant.

Not what I was thinking about here, but also something we should be considering.

item variation

That (e.g., better GLM support) is being discussed in other threads - @dengemann and I are hoping to focus GSOC on this.

If you bootstrap say the mean over channels for each condition within a subject, then non-overlap means "significant".

Yes, but non-overlap doesn't mean not significant. Take the (silly) case of (0, 1, 2, 3, 4) vs. (0, 2, 3, 4, 5). The 95% CIs for the two overlap, but the CI for the difference excludes zero and the dependent-samples t-test is significant (with, as noted, the exact same p value as the 1-sample t-test of a minus b). This was discussed by @palday and @mmagnuski above.

jona-sassenhagen commented 5 years ago

For within-subject CIs of the form suggested by Craddock -- the easiest way would be to extend the current API to allow a dict (conditions) of lists (subjects) of lists (trials) of Epochs. This would be a non-breaking change. Potentially add an additional flag within=False that would get passed to the CI function and show the difference between within=False and within=True in the example.

Thinking about it more, this might just be something we should have completely regardless of anything else discussed here - just for the convenience of not having to compute, store and handle the evokeds externally. The API is very natural. It would make a common operation much more convenient. @mmagnuski if you agree, I'll open a separate issue and assign @JoseAlanis :D

dengemann commented 5 years ago

Yes, but non-overlap doesn't mean not significant. Take the (silly) case of (0, 1, 2, 3, 4) vs. (0, 2, 3, 4, 5). The 95% CIs for the two overlap, but the CI for the difference excludes zero and the dependent-samples t-test is significant (with, as noted, the exact same p value as the 1-sample t-test of a minus b). This was discussed by @palday and @mmagnuski above.

I think we have a misunderstanding here. First of all this silly data is too silly to support inference. Second this would not be a dependent-sample. Neither 1-sample, dependent tests nor paired differences are applicable here.

dengemann commented 5 years ago

My general position is not to implement non-standard options for visualization and go for conservative options. If we can do it with the bootstrap or analytical tools that's good @jona-sassenhagen @palday @mmagnuski.

mmagnuski commented 5 years ago

@jona-sassenhagen Do you mean extending the API of the ci function? That might be useful if it is a public function, but I am not sure about the exact API.

palday commented 5 years ago

@jona-sassenhagen It might make sense to have the plot call on a dict of lists of lists of Epochs optionally return a dict of lists of Evokeds in addition to the Figure (so that there's no need to compute these again later). Not sure if that makes the API too complex though, especially since averaging is very fast anyway.

jona-sassenhagen commented 5 years ago

I think we have a misunderstanding here

It seems so. But just consider: while nonoverlap of CIs implies significance, overlap of CIs doesn't imply non-significance. That is all.

@jona-sassenhagen Do you mean extending the API of the ci function? That might be useful if it is a public function, but I am not sure about the exact API.

No, I think what @palday is suggesting, and what I like, is to extend the API of plot_compare_evokeds. Feed it a dict of lists of epochs. It internally constructs evokeds from the epochs based on dict keys, plots them as before, although perhaps with within-subject CIs.

This means you don't need to manually construct evokeds just to get a few evoked line plots, and is in fact free from an API perspective.

@jona-sassenhagen It might make sense to have the plot call on a dict of lists of lists of Epochs optionally return a dict of lists of Evokeds in addition to the Figure (so that there's no need to compute these again later). Not sure if that makes the API too complex though, especially since averaging is very fast anyway.

Meh. I'd rather have a helper that's used in plot_compare_evokeds internally, but can also be used to do this manually. That doesn't turn a plotter into something that returns dicts of MNE objects :)

But if we're really going to start worrying about such things, then we should probably also start worrying about AR error as well (because otherwise you have MC in time) .... and item variation for fields where that is a thing. In that sense, I think encouraging too much inference based on the CIs alone is a bit naive.

So reduce the CIs to account for dependence, and then blow them up again with FDR/Bonferroni. Why not?

dengemann commented 5 years ago

It seems so. But just consider: while nonoverlap of CIs implies significance, overlap of CIs doesn't imply non-significance. That is all.

P-hacking. It does when we have 2 samples, it does not if there is a dependency structure like in paired tests where you need to bootstrap the difference.

@jona-sassenhagen @palday @mmagnuski making a public CI function that takes makes CIs would be a first step. Then we could make use of it in several examples and also call it from within plot_compare_evokeds.

So reduce the CIs to account for dependence, and then blow them up again with FDR/Bonferroni. Why not?

I do not like this. FDR bonferoni is for p-values, CIs are not p-values. Let's please don't go punk.

mmagnuski commented 5 years ago

P-hacking. It does when we have 2 samples, it does not if there is a dependency structure like in paired tests where you need to bootstrap the difference.

That is the point of the issue - to allow for CI's of the within-subject difference in the viz (which is what one is interested in with within-subject designs).

jona-sassenhagen commented 5 years ago

I do not like this. FDR bonferoni is for p-values, CIs are not p-values. Let's please don't go punk.

I don't think so ...https://www.jstor.org/stable/27590520

dengemann commented 5 years ago

@palday @mmagnuski it turns out @jona-sassenhagen and I had a fundamental misunderstanding.

So forget the 2-sample problem wich is for single-subject data (within-subject is confusing). I missed the point that here we want to go over subjects.

dengemann commented 5 years ago

I don't think so ...https://www.jstor.org/stable/27590520

Let's not start with that. Bootstrap viz is a start, not the most rigorous way of doing inference.

dengemann commented 5 years ago

That is the point of the issue - to allow for CI's of the within-subject difference in the viz (which is what one is interested in with within-subject designs).

@mmagnuski we're getting there. I think we're on one page now. Let's just not implement correction factors that are not frequently used etc. We have some responsibility here ...

jona-sassenhagen commented 5 years ago

@mmagnuski what do you think about the plot_compare_evokeds API extension @palday suggested?

palday commented 5 years ago

single-subject data (within-subject is confusing). I missed the point that here we want to go over subjects.

I also had this problem when I first looked at this thread. All that said, I still don't think we should focus too much on making "inferential CIs" -- given all the subtleties already mentioned on this thread, I think making this stuff too easy encourages a very naive approach. The current CIs are sample-wise conservative, even if there are MC issues for the time-course itself -- and determining the window of your effect by looking at your ERPs has a tendency to introduce subtle circularity if you're not careful anyway. And even cluster tests for these things have a surprising amount of subtlety to them, as @jona-sassenhagen called attention to recently.

That's why I suggested the plot functionality the way I did -- it's convenient and "free" even without expanded CI functionality, and my proposed default for extended CI functionality would make it so that users have to actively seek out the less conservative within-subjects method. In other words, I'm largely with @dengemann on this one.

jona-sassenhagen commented 5 years ago

I'm using this already overly long thread to mention that in writing that short note, we came across a problem with sign-flip permutation tests that we may have to deal with eventually ...

jona-sassenhagen commented 5 years ago

https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1392359

mne-tools / mne-python

Within-subject confidence intervals #5812