pavlin-policar / openTSNE

Extensible, parallel implementations of t-SNE
https://opentsne.rtfd.io
BSD 3-Clause "New" or "Revised" License
1.44k stars 158 forks source link

Update docs #188

Closed pavlin-policar closed 3 years ago

pavlin-policar commented 3 years ago
Description of changes
Includes
dkobak commented 3 years ago

A small comment on your (very useful) speed benchmarks: I was reviewing a paper a few days ago that argued, among other things, that UMAP is much faster than t-SNE. Reported runtimes for openTSNE were very slow. Sample sizes were ~1000-10000. I was confused at first, but then realized that they did not reduce the dimensionality with PCA and had something like ~20000 dimensions. Turns out, in this setting Annoy gets super slow, and actually neighbors='exact' in openTSNE works much faster than using Annoy.

Pynndescent does not seem to suffer from this problem so much, making UMAP run much faster. However, setting neighbors='pynndescent'in openTSNE did not really increase the runtime, probably because openTSNE uses Pynndescent to query 90 neighbors whereas UMAP only needs 15 by default.

pavlin-policar commented 3 years ago

I have noticed that as well actually. I am running further benchmarks on higher dimensional data (2k dims) to compare Annoy and pynndescent. But it takes several days to finish all these runs. I'm planning to update the benchmarks once those finish as well. I have additional plots where I split the runtime of NN search and the optimization phases which might be of interest to you.

Plots NN search: ![benchmarks-umap-nns](https://user-images.githubusercontent.com/5758119/120921451-0124a300-c6c4-11eb-8455-4f7dd201d13d.png) Optimization: ![benchmarks-umap-optim](https://user-images.githubusercontent.com/5758119/120921454-0386fd00-c6c4-11eb-92e6-bba481908a59.png)

The openTSNE optimization is clearly a lot faster in the multi-threaded case, but the neighbor search is slower. Of course, this is expected, since UMAP only needs to look for 15 neighbors instead of the default 90 here. And for larger data sets, I like to use larger perplexity values e.g. 500, which of course makes this even slower. So UMAP has a clear advantage here just because it doesn't need as many neighbors. I've noticed that pynndescent has gotten a lot faster since the last time I ran it, but it's still slower for the 50 dimensions if I use that instead of Annoy.

I didn't include this in the docs, since most people wouldn't care about the separate stages of the algorithms.

However, using that many dimensions just never seems like a good idea. Most commonly, people use euclidean distance, and PCA will preserve euclidean distance. So in that case, using PCA seems like no-brainer. I'm not sure about other distance measures, but I often use cosine distance on top of PCA, and that seems to work pretty well as well.