Use of minD / doffset when computing perplexities

dbeach24 commented 7 years ago

Not necessarily a bug, but I'm a bit confused by your use of minimum distance (minD) and the Doffset parameter in Hbeta!(). (These do not appear in the author's Python version.) I can't be sure, but am wondering subtracting the distance to the nearest neighbor has somehow affected the beta values and conditional probabilities and, ultimately the preservation of local distances.

I'm doing a comparison of t-SNE algorithms for a class project, and I see some differences in the neighborhood spacings resulting from the Julia version versus the author's original implementation in Python. When both algorithms are run on a 4000 point MNIST sample, the results appear different, and the Python version seems to preserve spacing around the clusters better than the Julia version:

pythonvjulia

I'd be interested to know your thoughts on this.

dbeach24 commented 7 years ago

I created a modified version which drops the minD subtraction and Doffset value from the Hbeta! computation. When I do, I get a result that much more closely matches the original:

EDIT: Replacing the image with a more thorough comparison. The rows show samples sizes 1000, 2000, and 4000, respectively.

left column = Laurens van der Maaten's original Python implementation
middle column = TSne.jl with changes mentioned above
right column = TSne.jl unmodified

julia_tsne_mod

lejon commented 7 years ago

Thanks so much for the investigation! That certainly looks suspicious! Could you do a pull request on your fixes?

lejon commented 7 years ago

@alyst I believe the offset was introduced in your commit

Hbeta(): avoid & detect P-value degeneration: b5e4e02e09121fa607f00be1e615e49445413a3f

any idea what is causing this?

alyst commented 7 years ago

@lejon yes, I added Doffset to avoid the degeneration of log(sum(exp(D))). In theory, this should still be the same formula, but maybe in this or subsequent commits something got screwed up.

alyst commented 7 years ago

@dbeach24 could you pls post the script that you use to generate the plot? I would like to check if I can fix the bug.

alyst commented 7 years ago

@dbeach24 Or you can test the branch in #12

dbeach24 commented 7 years ago

@alyst The changes you made in #12 are identical to my local modification, and therefore will correspond to the middle column of the graphic I posted earlier.

EDIT: Strike that! This branch is still subtracting minD from the values in the distance matrix. Not sure this is the same anymore, but will try a test soon.

alyst commented 7 years ago

Yes, it's still subtracting minD, but the result should be identical to the original algorithm (excluding the cases when original algorithm would overflow). In b5e4e02e09121fa607f00be1e615e49445413a3f I forgot to take dot(D, P) into account.

dbeach24 commented 7 years ago

Okay, here's my plot:

left = Python
middle = TSne.jl branch #12
right = TSne.jl with my modifications

figure_1

In short, they all look about the same.

alyst commented 7 years ago

Python results have lower KL, do you think it has some impact on the layout quality?

dbeach24 commented 7 years ago

No visual difference that I can immediately detect. Do you think the K-L difference is significant?

alyst commented 7 years ago

Do you use the same MNIST subset (for given N) for all 3 tSNE versions or it's different random subset each time? In my experience KL reaches some near-optimal values rather fast, and then the improvement rate decreases dramatically. So 0.95 vs 0.99 difference can justify why Python is 15 times slower. Do you track the number of iterations?

dbeach24 commented 7 years ago

@alyst I subset the data first, separately, and save it to dedicated test files. All these runs should be consuming the same subsets of MNIST. I thought I remembered the Python version producing different K-L values between runs. (I am not using a fixed random seed in my testing of the algorithms, so I assumed this was expected.)

I could try to do some repeated runs of the Python version and sample the distribution of final K-L values after 1000 iterations. This will take me a bit of time to set up, however.

dbeach24 commented 7 years ago

Also, based on my reading, the Python code is rather inefficient. I assumed that the 15x performance difference was due to the rather suboptimal way the Python is coded, as well as the intrinsic performance advantage of Julia.

alyst commented 7 years ago

This implementation uses random layout by default (controlled by pca_init param), so different KL between runs is also expected. Also, by default it does max_iter=1000 iterations. Does it match the Python version?

dbeach24 commented 7 years ago

Both algorithms are using 1000 iterations, perplexity=20.0, and using a PCA reduction to 50 dimensions as a preprocessing step.

lejon commented 7 years ago

Thanks for your help with this! :) I have merged your pull request @alyst!

lejon commented 7 years ago

Unless you protest, I'll consider this closed! Cheers!

lejon / TSne.jl

Use of minD / doffset when computing perplexities #11