mobeets / nullSpaceControl

1 stars 0 forks source link

feasibility of assessing 8d joint distribution #182

Closed mobeets closed 8 years ago

mobeets commented 8 years ago

read those papers larry suggested look into implementation etc.

mobeets commented 8 years ago

Kolmogorov-Smirnov two-sample test compares whether two empirical distributions are equal by taking the maximum different between their empirical cdfs. E.g., below, red and blue below are two example empirical distributions:

screen shot 2016-05-26 at 2 16 56 pm

but there's no natural way of extending this to multiple dimensions.

This Peter Hall paper (rec'd by Larry) and this StackExchange question both seem to suggest fitting a kernel density estimator (KDE) to each distribution and then comparing them by taking the L2 norm of their differences. (Eqn 2.9 of the Hall paper gives a Monte Carlo way of doing hypothesis testing that the distributions are the same.) Also Peter Hall mentions how you want to use the same smoothing parameter for both KDEs.

mobeets commented 8 years ago

code for (an old version of?) mmd: http://people.kyb.tuebingen.mpg.de/arthur/mmd.htm

MMD (Smola et al) is a generalization of the L2 metric used in the KDE approach to an RKHS space. it is also shown to be a generalization of the Kolmogorov-Smirnov test and "earth-mover distances."

"Our test statistic is the maximum mean discrepancy (MMD), defined as the maximum deviation in the expectation of a function evaluated on each of the random variables, taken over a sufficiently rich function class: in our case, a reproducing kernel Hilbert space (RKHS). Equivalently, the statistic can be written as the norm of the difference between distribution feature means in the RKHS. We do not require density estimates as an intermediate step."

MMD picks a sigma bandwidth value like this: "Pick sigma such that the exponent of exp(- ||x-y|| / (2*sigma2)), in other words ||x-y|| / (2*sigma2), equals 1 for the median distance x and y of all distances between points from both data sets X and Y."

mobeets commented 8 years ago

also, with regard to KDEs over histograms: "The smoothness of the kernel density estimate is evident compared to the discreteness of the histogram, as kernel density estimates converge faster to the true underlying density for continuous random variables.[6]" (src)

screen shot 2016-05-26 at 2 53 52 pm

There's also an "adaptive" KDE approach (src) that's supposed to be good for high dimensions.

mobeets commented 8 years ago

No multivariate KDE methods exist for Matlab. Here's a really great post on comparing methods available in Python. Scikit-learn's might be the best in terms of speed given sample size.

One key question, however, is whether or not to use the same bandwidth for every KDE across hypotheses? According to Peter Hall this is ideal, but not sure how to do this yet. MMD above gave a heuristic.

mobeets commented 8 years ago

in the MMD paper, section 3.3.1 introduces the L2 distance approach from Anderson et al (1994).

The Hall paper explains why you want to use the same smoothing parameter in the KDEs: " Such a choice ensures that, under the null hypothesis that the two samples of curves come from identical populations, the main effects of differing observation-time distributions and differing subsample sizes (for the different curves) cancel. As a result, the effect that smoothing has on bias is an order of magnitude less than it would be if different bandwidths, tuned to the respective curves, were used"

mobeets commented 8 years ago

According to the python post, it seems that scikit-learn's implementation is best. Here's how to select the bandwidth for some 1d data x:

from sklearn.grid_search import GridSearchCV
grid = GridSearchCV(KernelDensity(),
                    {'bandwidth': np.linspace(0.1, 1.0, 30)},
                    cv=20) # 20-fold cross-validation
grid.fit(x[:, None])
print grid.best_params_

And here's how to display it (for 1d at least) on the grid x_grid:

kde = grid.best_estimator_
pdf = np.exp(kde.score_samples(x_grid[:, None]))

fig, ax = plt.subplots()
ax.plot(x_grid, pdf, linewidth=3, alpha=0.5, label='bw=%.2f' % kde.bandwidth)
ax.hist(x, 30, fc='gray', histtype='stepfilled', alpha=0.3, normed=True)
ax.legend(loc='upper left')
ax.set_xlim(-4.5, 3.5);

screen shot 2016-05-26 at 4 16 03 pm

mobeets commented 8 years ago

Common bandwidth choice from Hall:

"One approach to common bandwidth choice is to use a “favourite” method to compute an empirical bandwidth for each curve Xi and Yj , and then take the average value to be the common bandwidth. Another technique, appropriate in the case of plug-in rules, is to use an average value of each of the components of a plug-in bandwidth selector, and assemble the average values, using the plug-in formula, to form the common bandwidth. A third approach, valid in the context of cross-validation, is to use a criterion which is the average of the cross-validatory criteria corresponding to the different curves. For each of these methods, “average” might be defined in a weighted sense, where the weights represent the respective subsample sizes."

mobeets commented 8 years ago

So here's the run-down:

mobeets commented 8 years ago

write in Python, bootstrap random samples from grid to evaluate

mobeets commented 8 years ago

execution time to calculate kdeError for two hypotheses compared to true kde increases as size of latents goes up. for comparison, 20120601 has 18,000 timepoints. that means this will take at least 5 mins, and that's just for two hyps.

screen shot 2016-05-27 at 12 01 13 pm

mobeets commented 8 years ago

two ideas to make this shorter:

mobeets commented 8 years ago

okay, well, doing the 2nd option of the above still takes a while, but it does appear to give significantly different scores:

i split the true data in half, and bootstrapped from the testing half five times. that's bar 1. bar 2 is pruning fit 5 times.

screen shot 2016-05-27 at 3 06 40 pm

mobeets commented 8 years ago

and, actually the same is true if you bootstrap instead the timepoints (used both to fit the kde and to take from the hypotheses). this one only needed 5 reps of 1000 samples

screen shot 2016-05-27 at 3 19 00 pm

mobeets commented 8 years ago

i'm gonna conclude this is infeasible.

i have it implemented, but it's going to take a lot of work to know how to interpret the various scoring systems i've come up with, and it doesn't seem very straightforward, so i might suggest we table this option and stick with mean/variance.