mobeets / nullSpaceControl

1 stars 0 forks source link

hypothesis error by timepoint? #101

Closed mobeets closed 8 years ago

mobeets commented 8 years ago

can you first get a null distribution by shuffling observeds to predict observeds? then compare this to the hypotheses.

mobeets commented 8 years ago

can't just shuffle:

say X,Y ~ N(0,1) and you try to predict n draws of X and Y with each other, sample for sample.

well, this will have some error, but if Y instead is just a constant (the mean of X: 0) then the error will be lower.

so this shuffling thing doesn't measure a match in distribution...not even a match in mean.

i'd need a nonparametric comparison of distributions...

mobeets commented 8 years ago

this is the two-sample problem

according to larry, simplest way is basically to do kernel density estimation (aka parzen window estimation) on each of the two distributions and then take the L_2 norm of their differences. the bandwidth can be relatively large (see: peter hall paper linked below), which is good for high-d

http://www.jmlr.org/papers/volume13/gretton12a/gretton12a.pdf http://www3.stat.sinica.edu.tw/statistica/oldpdf/a17n412.pdf https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test

the smola MMD paper actually has a good discussion of the parzen window method as well

mobeets commented 8 years ago

though from smolas, also: "This suggests it is not necessary to solve the more difficult problem of density estimation in high dimensions to do two-sample testing."

mobeets commented 8 years ago

Kolmogorov-Smirnov (K-S) too