Data used to compute p-values and plotting cluster analysis

hcuve commented 4 years ago

Hi @jwdink and @brockf

I know you are not actively working on this anymore, but I have a question that is a bit more conceptual and that I was hoping to clarify. Regarding the cluster analysis, I am slightly confused about whether the final data used to plot the results of the permutation cluster analysis and the histograms are based on test statistics of the individual time bins during the bootstrapping process, or on the cluster-summed statistics obtained from the bootstrapping. If it is the latter, shouldn't we expect not to see any density around zero, since all the cluster-summed statistics are for only significant (or above statistic threshold criteria) timebins. Also, it is not intuitive to me why we are using the observed cluster-summed statistics to compare with either the shuffled statistics of bins or of clusters, since the observed cluster-level statistics will have high values to begin with (why don't we, for example, compare original timebin-level t-value with the null distribution of random timebin-level t-values). Granted this is probably my confusion with the method, but it didn't seem to get any clearer by reading the Maris reference nor I was able to figure this out from the source code. Appreciate any feedback.

Thanks, Helio

jwdink commented 4 years ago

As you said we aren't able to maintain this codebase lately, but I think I can answer these more conceptual questions from memory. Here's my best attempt:

I am slightly confused about whether the final data used to plot the results of the permutation cluster analysis and the histograms are based on test statistics of the individual time bins during the bootstrapping process, or on the cluster-summed statistics obtained from the bootstrapping.

The latter: the histogram shows the distribution of the sums.

If it is the latter, shouldn't we expect not to see any density around zero, since all the cluster-summed statistics are for only significant (or above statistic threshold criteria) timebins.

Great point. I'd have to dive into the code to confirm, but I believe the sum is zero on iterations with no runs that exceed the threshold.

Also, it is not intuitive to me why we are using the observed cluster-summed statistics to compare with either the shuffled statistics of bins or of clusters, since the observed cluster-level statistics will have high values to begin with

I don't quite follow you here.

(why don't we, for example, compare original timebin-level t-value with the null distribution of random timebin-level t-values).

I'm not sure, but my guess is because it violates assumptions about independence.

Hope this is helpful!

hcuve commented 4 years ago

Thank you so much @jwdink, this helps. I am trying to write my own version of this to compare with the results to get a full hang of it, but just to clarify one last point.

the histogram is taking all the significant simulated cluster summed statistics + ONLY the individual time bins t statistic where the simulation returns a non significant t statistic, and that's what you use to calculate the p-value?

Thank you once again, and I hope the community takes up on maintaining the wonderful work you've done with this package.

jwdink commented 4 years ago

the histogram is taking all the significant simulated cluster summed statistics + ONLY the individual time bins t statistic where the simulation returns a non significant t statistic, and that's what you use to calculate the p-value?

On each iteration, we get some number >= 0 of "clusters", i.e. adjacent time bins whose statistic was greater than our threshold. On iterations when num-clusters is > 0, we take the sum for each cluster, and record in the histogram the max of these. On iterations where num-clusters is == 0, we record in the histogram 0.

hcuve commented 4 years ago

thank you @jwdink , it's very clear now now. Appreciate that!

jwdink / eyetrackingR

Data used to compute p-values and plotting cluster analysis #71