Closed mobeets closed 8 years ago
Kolmogorov-Smirnov two-sample test compares whether two empirical distributions are equal by taking the maximum different between their empirical cdfs. E.g., below, red and blue below are two example empirical distributions:
but there's no natural way of extending this to multiple dimensions.
This Peter Hall paper (rec'd by Larry) and this StackExchange question both seem to suggest fitting a kernel density estimator (KDE) to each distribution and then comparing them by taking the L2 norm of their differences. (Eqn 2.9 of the Hall paper gives a Monte Carlo way of doing hypothesis testing that the distributions are the same.) Also Peter Hall mentions how you want to use the same smoothing parameter for both KDEs.
code for (an old version of?) mmd: http://people.kyb.tuebingen.mpg.de/arthur/mmd.htm
MMD (Smola et al) is a generalization of the L2 metric used in the KDE approach to an RKHS space. it is also shown to be a generalization of the Kolmogorov-Smirnov test and "earth-mover distances."
"Our test statistic is the maximum mean discrepancy (MMD), defined as the maximum deviation in the expectation of a function evaluated on each of the random variables, taken over a sufficiently rich function class: in our case, a reproducing kernel Hilbert space (RKHS). Equivalently, the statistic can be written as the norm of the difference between distribution feature means in the RKHS. We do not require density estimates as an intermediate step."
MMD picks a sigma bandwidth value like this:
"Pick sigma such that the exponent of exp(- ||x-y|| / (2*sigma2))
, in other words ||x-y|| / (2*sigma2)
, equals 1 for the median distance x and y of all distances between points from both data sets X and Y."
also, with regard to KDEs over histograms: "The smoothness of the kernel density estimate is evident compared to the discreteness of the histogram, as kernel density estimates converge faster to the true underlying density for continuous random variables.[6]" (src)
There's also an "adaptive" KDE approach (src) that's supposed to be good for high dimensions.
No multivariate KDE methods exist for Matlab. Here's a really great post on comparing methods available in Python. Scikit-learn's might be the best in terms of speed given sample size.
One key question, however, is whether or not to use the same bandwidth for every KDE across hypotheses? According to Peter Hall this is ideal, but not sure how to do this yet. MMD above gave a heuristic.
in the MMD paper, section 3.3.1 introduces the L2 distance approach from Anderson et al (1994).
The Hall paper explains why you want to use the same smoothing parameter in the KDEs: " Such a choice ensures that, under the null hypothesis that the two samples of curves come from identical populations, the main effects of differing observation-time distributions and differing subsample sizes (for the different curves) cancel. As a result, the effect that smoothing has on bias is an order of magnitude less than it would be if different bandwidths, tuned to the respective curves, were used"
According to the python post, it seems that scikit-learn's implementation is best. Here's how to select the bandwidth for some 1d data x
:
from sklearn.grid_search import GridSearchCV
grid = GridSearchCV(KernelDensity(),
{'bandwidth': np.linspace(0.1, 1.0, 30)},
cv=20) # 20-fold cross-validation
grid.fit(x[:, None])
print grid.best_params_
And here's how to display it (for 1d at least) on the grid x_grid
:
kde = grid.best_estimator_
pdf = np.exp(kde.score_samples(x_grid[:, None]))
fig, ax = plt.subplots()
ax.plot(x_grid, pdf, linewidth=3, alpha=0.5, label='bw=%.2f' % kde.bandwidth)
ax.hist(x, 30, fc='gray', histtype='stepfilled', alpha=0.3, normed=True)
ax.legend(loc='upper left')
ax.set_xlim(-4.5, 3.5);
Common bandwidth choice from Hall:
"One approach to common bandwidth choice is to use a “favourite” method to compute an empirical bandwidth for each curve Xi and Yj , and then take the average value to be the common bandwidth. Another technique, appropriate in the case of plug-in rules, is to use an average value of each of the components of a plug-in bandwidth selector, and assemble the average values, using the plug-in formula, to form the common bandwidth. A third approach, valid in the context of cross-validation, is to use a criterion which is the average of the cross-validatory criteria corresponding to the different curves. For each of these methods, “average” might be defined in a weighted sense, where the weights represent the respective subsample sizes."
So here's the run-down:
write in Python, bootstrap random samples from grid to evaluate
execution time to calculate kdeError for two hypotheses compared to true kde increases as size of latents goes up. for comparison, 20120601 has 18,000 timepoints. that means this will take at least 5 mins, and that's just for two hyps.
two ideas to make this shorter:
okay, well, doing the 2nd option of the above still takes a while, but it does appear to give significantly different scores:
i split the true data in half, and bootstrapped from the testing half five times. that's bar 1. bar 2 is pruning fit 5 times.
and, actually the same is true if you bootstrap instead the timepoints (used both to fit the kde and to take from the hypotheses). this one only needed 5 reps of 1000 samples
i'm gonna conclude this is infeasible.
i have it implemented, but it's going to take a lot of work to know how to interpret the various scoring systems i've come up with, and it doesn't seem very straightforward, so i might suggest we table this option and stick with mean/variance.
read those papers larry suggested look into implementation etc.