Barnes-Hut optimization with the default learning rate collapses on small datasets

dkobak commented 1 year ago

This is unexpected. I wanted to take another look at why the pickling test fails in my pending PR, and suddenly noticed that the following happens in the current master branch:

from openTSNE import TSNE

x = np.random.seed(42)
x = np.random.randn(100,10)
e = TSNE(verbose=True).fit(x)

Output:

--------------------------------------------------------------------------------
TSNE(early_exaggeration=12, verbose=True)
--------------------------------------------------------------------------------
===> Finding 90 nearest neighbors using exact search using euclidean distance...
   --> Time elapsed: 0.00 seconds
===> Calculating affinity matrix...
   --> Time elapsed: 0.01 seconds
===> Calculating PCA-based initialization...
   --> Time elapsed: 0.00 seconds
===> Running optimization with exaggeration=12.00, lr=8.33 for 250 iterations...
Iteration   50, KL divergence -0.3941, 50 iterations in 0.0226 sec
Iteration  100, KL divergence -44.2007, 50 iterations in 0.0136 sec
Iteration  150, KL divergence -44.2007, 50 iterations in 0.0125 sec
Iteration  200, KL divergence -44.2007, 50 iterations in 0.0113 sec
Iteration  250, KL divergence -44.2007, 50 iterations in 0.0131 sec
   --> Time elapsed: 0.07 seconds
===> Running optimization with exaggeration=1.00, lr=100.00 for 500 iterations...
Iteration   50, KL divergence -44.2007, 50 iterations in 0.0126 sec
Iteration  100, KL divergence -44.2007, 50 iterations in 0.0121 sec
Iteration  150, KL divergence -44.2007, 50 iterations in 0.0114 sec
Iteration  200, KL divergence -44.2007, 50 iterations in 0.0124 sec
Iteration  250, KL divergence -44.2007, 50 iterations in 0.0136 sec
Iteration  300, KL divergence -44.2007, 50 iterations in 0.0136 sec
Iteration  350, KL divergence -44.2007, 50 iterations in 0.0119 sec
Iteration  400, KL divergence -44.2007, 50 iterations in 0.0130 sec
Iteration  450, KL divergence -44.2007, 50 iterations in 0.0118 sec
Iteration  500, KL divergence -44.2007, 50 iterations in 0.0134 sec
   --> Time elapsed: 0.13 seconds

So the KL divergence is negative, and the embedding coordinates are close to machine zero:

TSNEEmbedding([[-7.84121109e-31, -2.26367418e-29],
               [-4.10631512e-31,  5.68397401e-30],
               [ 1.52936635e-32,  2.00352502e-31],

This does not happen if I use

e = TSNE(verbose=True, negative_gradient_method='fft').fit(x)

so it has to be something with Barnes-Hut.

Moreover, it seems it broke only recently, because using our old default learning rate

e = TSNE(verbose=True, learning_rate=200).fit(x)

converges correctly! To the same loss as using FFT.

Weirdly, current default learning rate for is 8.3, and then 100, both smaller than 200, so I don't understand how it can lead to problems :-/

pavlin-policar commented 1 year ago

Hmm, this is a bit worrying and definitely looks like something strange that we need to investigate. I'm getting the same results. Perhaps it has something to do with how we handle learning rate now, where we incorporate the learning rate into the . The collapse seems to stem from the early exaggeration phase.

For instance,

e = TSNE(verbose=True, early_exaggeration=1, learning_rate="auto").fit(x)

works fine, but

e = TSNE(verbose=True, early_exaggeration=12, learning_rate="auto").fit(x)

produces what you reported.

I thought it may have something to do with the learning rate, which, for ee=12, turns out to be around 8, but

e = TSNE(verbose=True, early_exaggeration=1, learning_rate=8).fit(x)

works fine. It seems that a combination of low learning rate and higher exaggeration values produce this collapse.

Perhaps the simplest fix might be to keep a minimum learning rate, e.g. around 100. This fixes the problem in this case. However, this feels more like a band-aid than a real fix. There must be something in the BH implementation, perhaps due to rounding. Definitely worth investigating, as this doesn't seem like such an unusual use-case.

dkobak commented 1 year ago

Yeah that's interesting. I have a suspicion that the reason it works fine with learning rate 100 is almost accidental. My feeling is that the embedding does not converge! The learning rate is too big, so the points jump around 0 without converging anywhere but also without collapsing.

bh-200

This kinda works for subsequent optimization but actually it's not what one wants to have!

When we set learning rate to 8, it does not fluctuate anymore, but steadily collapses to machine precision zero. I suspect the true size of this embedding should be really small, and maybe BH goes haywire and makes everything collapse at some point? Or perhaps the true size of the embedding really is 0? In a sense that this somehow minimizes the loss function? I'm not sure about it.

But the main issue maybe that early exaggeration factor 12 is too large for n=1000, and we just did not see it before because this was masked by the learning rate being too large and so everything was jumping around.

I want to explore this a bit more.

dkobak commented 1 year ago

Also, it only happens if the data are unstructured. If you generate the data like this:

x = np.random.seed(42)
x = np.random.randn(100,10)
x[:50, 0] += 10

then using the defaults works fine and does not collapse to a single point.

dkobak commented 1 year ago

I think there may be two distinct issues here.

The first issue is some intricate problem with the BH implementation that leads to negative KL values. We saw this before in #180 when the points were overlapping in the initialization. Here the same thing happens whenever the points get very close to each other during optimization.

The second issue is that early exaggeration 12 is "too strong" for some datasets, in particular small ones. This results in the embedding collapsing either to a single point, or sometimes to a 1-dimensional line. This may or may not lead to problems in subsequent optimization, and also seems highly dependent on the dataset and on the perplexity. I now played a bit around with my simulation and this happens only for some particular values of the sample size and dimensionality (and perplexity).

Note that the original 2008 t-SNE paper used early exaggeration factor 4. It was set to 12 in the Barnes-Hut t-SNE paper from 2012. I think this agrees with the idea that smaller datasets need smaller early exaggeration. Also, I recall that in https://www.nature.com/articles/s41586-020-2907-3 I wrote the following:

We used the t-SNE implementation from scikit-learn Python library with the default perplexity (30), early exaggeration 4 (the default value 12 can be too large for small data sets), and scaled PCA initialization[23].

(the sample size was ~1000), but never really explored this systematically, and can't remember what exact issues I had with the factor 12 back then.

My thinking now is that it may actually make sense to use early_exaggeration="auto", defaulting to 4 for small enough datasets and to 12 otherwise. Not sure where to set the threshold, but maybe it could be the same as what we use for BH/FFT choice: auto means BH for n<10,000.

pavlin-policar / openTSNE

Barnes-Hut optimization with the default learning rate collapses on small datasets #233