shogun-toolbox / shogun

Shōgun
http://shogun-toolbox.org
BSD 3-Clause "New" or "Revised" License
3.03k stars 1.04k forks source link

Allow different numbers of points for P and Q in MMD/HSIC #1918

Closed karlnapf closed 10 years ago

karlnapf commented 10 years ago

and in general! This requires to change some minor bits of the implementation. CQuadraticTimeMMD CHSIC CLinearTimeMMD (last one requires some thought as the expression is not in the reference papers)

lambday commented 10 years ago

@karlnapf for QuadraticTimeMMD should the statistic returned still be m*MMD when m and n are different?

karlnapf commented 10 years ago

Good point! I don't really know to be honest. Have to talk to @sejdino on that.

sejdino commented 10 years ago

Hey @lambday, in the m!=n case, it ends up being (m+n)*MMD that has a nice distribution, so this is what we should be returning. Have a look at this paper: eq (3) is the unbiased MMD, (5) is biased (ignore the square root though--we are always working with the squared MMD) and eq (10) gives the asymptotic distribution. Conveniently, the permutation test will work in exactly the same way: merge everything, split randomly into m and n, recompute. For spectral approximation, you will need to take into account ratios rho_x and rho_y like in eq. (10). Gamma approximation might be messy, so I think we should do that only for m=n for now.

BTW, I would hold off changing anything in CLinearTimeMMD (and in B-test) for now - since it's not clear how best to reconcile m!=n with the streaming facility - we will need to think about this a bit.

Also, no need to do anything in HSIC - independence testing doesn't make sense with different numbers of samples.

lambday commented 10 years ago

Thanks @sejdino. Going through the paper to make sure I get these things right.

lambday commented 10 years ago

@karlnapf @sejdino Hi, I wanted to work on this and gathered some confusion regarding the spectral approximation when m != n for quadratic time MMD. The current implementation uses the formula in this paper for approximating mMMD_b^2 as this 2 * sum_l [1/(2m) * nu_l \ z_l^2] where nu_l are the eigenvalues of centered gram matrix. I thought of using eq. 10 from A kernel two-sample test paper using rho_x and rho_y, but I am not sure what should I use as an estimator of the eigenvalues of the equation when m and n are different. Would that be 1/(m+n) times eigenvalues of the centered matrix?

sejdino commented 10 years ago

Precisely, these are 1/(m+n) times the eigenvalues of the centred kernel matrix on the merged samples. These can then just be plugged into the eq. (10) from "A kernel two-sample test" with rho_x=m/(m+n) and rho_y=n/(m+n).

lambday commented 10 years ago

@sejdino thanks a lot. I am hoping to finish this by tomorrow :)

karlnapf commented 10 years ago

And done! @lambday You will have a long list of resolved issues for your proposal ;)