[FEA] t-SNE initialization, learning rate, and exaggeration

dkobak commented 4 years ago

I was in contact with Victor Lafargue who suggested I ping @danielhanchen for all cuML t-SNE-related questions. So here it goes. It's three suggestions, including some questions.

1) Recent research https://www.nature.com/articles/s41467-019-13056-x (disclaimer: mine) suggests that smart non-random initialization can be very important for good embeddings. For example, UMAP uses Laplacian Eigenmaps initialization, but t-SNE is often initialized randomly, which leads to unfair comparison (see here for more details on UMAP vs t-SNE initialization https://www.biorxiv.org/content/10.1101/2019.12.19.877522v1). Many t-SNE implementations such as FIt-SNE and openTSNE have recently transitioned to using PCA initialization as default. This should be very simple to implement, and is more sensible than using random initialization. Does cuML use random init for t-SNE? If so, would be great to default to PCA (or some other non-random) init.

2) Another issue discussed in the same paper is the learning rate: the traditionally default learning rate (200) can be WAY too small for large datasets. We recommend using learning_rate = N/12 heuristic as default for sample size N (where 12 is the early exaggeration factor). I saw that cuML uses some adaptive method by default ("Uses a special adpative method that tunes the learning rate, early exaggeration and perplexity automatically based on input size") but could not find a description anywhere. How exactly are these parameters chosen?

3) We also show (in that paper and also elsewhere in upcoming work) that t-SNE with exaggeration>1 can be helpful for large data sets (and exaggeration=4 produces results very similar to UMAP). Would be great to have this parameter adjustable in the API (not early exaggeration, but the exaggeration that is used after early exaggeration stops).

danielhanchen commented 4 years ago

Hi @dkobak.

Anyways fantastic research paper! Good work! Yes you are correct that PCA init or say Laplacian Eigenmaps etc will generate much better TSNE outputs. Currently, TSNE does support random or PCA init. The reason why random is the default is because Sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html)'s default is random. However, I agree with your suggestion that PCA should be default.
Yes, LRs of 200 are way too low. I noticed this as well. However, currently a heurestic (as u mentioned) is used. Fascinatingly, by pure coincidence,
```
self.pre_learning_rate = max(n / 3.0, 1)
self.post_learning_rate = self.pre_learning_rate
self.early_exaggeration = 24.0 if n > 10000 else 12.0
```

So, currently n/3 is used (I just guessed this due to empirical runnings on MNIST etc etc). Changing to n/12 or n/early_exaggeration sounds much more reasonable since there's now real research to back this (ie yours).

So currently, it's disabled, but post_learning_rate is the one you're talking about. It's currently set to the self.pre_learning_rate, but this can be changed. No algorithmic changes necessary.

dkobak commented 4 years ago

@danielhanchen Thanks for a quick reply!

Yes! Strongly agree with using PCA init by default even though sklearn does not. The documentation here https://docs.rapids.ai/api/cuml/stable/api.html?highlight=tsne#cuml.TSNE only mentions random init as supported. If PCA init is already implemented, that's great. By the way, as described in our paper, it's better to scale the initialization to have small radius. The default random Gaussian init has std=0.0001, so we scale PCA init to make PC1 have std=0.0001.
Oh I see. There is another reference for learning_rate = n/early_exaggeration: https://www.nature.com/articles/s41467-019-13055-y (published back to back with ours). In fact, using n/3 together with early exaggeration 12 or 24 could lead to divergences in some cases.
It's not post_learning_rate that I meant here, but post_exaggeration.

danielhanchen commented 4 years ago

@dkobak Oh yep. PCA scaling is done. Although slightly differently. I think (can't remember 100%) it was uniform but within [-0.0001f, 0.0001f].

Interesting! n/early_exaggeration sounds much better then.

Oh yes ok I misread that. You essentially would like after 250 iterations or so, to instead of VAL *= (1 / early_exaggeration) to become VAL *= (post_exaggeration / early_exaggeration) . VAL is the values for CSR sparse format.

dkobak commented 4 years ago

Yes, exactly!

resnant commented 4 years ago

Hi @danielhanchen It would be great if we can use PCA init in TSNE! You mentioned PCA init is already implemented in cuML TSNE, but I could not find out it in the latest code. https://github.com/rapidsai/cuml/blob/branch-0.15/python/cuml/manifold/t_sne.pyx#L239 Is it in other repo or branch?

danielhanchen commented 4 years ago

@resnant Oh I just checked. Seems like I was wrong. I was working on https://github.com/rapidsai/cuml/pull/1383 [TSNE PCA Init + 50% Memory Reductions]. It included PCA Init, 50% mem reductions and some stability fixes. Just forgot it didn't get merged yet.

I'm not so sure when I'll get back to it, but I'll see what I can do!

resnant commented 4 years ago

@danielhanchen Got it. Your work is awesome! It will be very helpful to me if it merged. Thanks anyway.

danielhanchen commented 4 years ago

@resnant Thanks a lot!!! I'll see what I can do.

github-actions[bot] commented 3 years ago

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

sup3rgiu commented 6 months ago

Any plans to work on PCA initialization again?

In scikit, it's been the default for almost 2 years: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

rapidsai / cuml

[FEA] t-SNE initialization, learning rate, and exaggeration #2375