Memory collapses with precomputed block matrix

fsvbach commented 3 years ago

Expected behaviour

When I run tSNE on a symmetric 200x200 block matrix such as this one distancematrix0 I expect TSNE to return 4 distinct clusters (actually 4 points only). Sklearn yields this.

Actual behaviour

Using openTSNE the terminal crashes with full memory (50% of the time). If it survives the clusters are visible, however the result is not as satisfying.

Steps to reproduce the behavior

matrix = Block matrix tsne = TSNE(metric='precomputed', initialization='spectral', negative_gradient_method='bh') embedding = tsne.fit(matrix)

NOTE: I am using the direct installation from GitHub this morning.

pavlin-policar commented 3 years ago

Please provide a minimal code example where this fails so we can reproduce and debug this. There really isn't anything obvious that would cause the memory to balloon.

Does the same thing happen when using negative_gradient_method="fft"?

fsvbach commented 3 years ago

Okay, with negative_gradient_method="fft" it works, but I sometimes get the RuntimeWarning: invalid value encountered in log kldivergence += sum_P * np.log(sum_Q + EPSILON)

A minimal example would be:

from openTSNE import TSNE
import numpy as np
matrix = np.genfromtxt('matrix.csv', delimiter=',')
tsne = TSNE(metric='precomputed', initialization='spectral', negative_gradient_method='bh')
embedding = tsne.fit(matrix)

with the matrix in matrix.csv matrix.csv

Note: It doesn't crash always, but every third or fourth time, so I think its due to some initialization?

pavlin-policar commented 3 years ago

What I suspect may be happening is that if the points are too close together in the embedding, the BH tree balloons. However, I should have some safeguards in place so that this doesn't happen. Maybe those safeguards aren't enough. This might be plausible since you're trying to force all the points super close together. But I'd need to investigate it further. I suggest you use fft for the time being.

dkobak commented 3 years ago

I expect TSNE to return 4 distinct clusters (actually 4 points only). Sklearn yields this.

Actually the points should not be on top of each other in the embedding. When some points have zero distance in the original embedding, they should get finite affinities and should end up close but not overlapping in the embedding.

If you get overlapping points in sklearn, that would be weird.

Can you post a picture of sklearn embedding, and also of what you get from openTSNE when it does not crash?

fsvbach commented 3 years ago

Yes. openTSNE_bh This is the result of openTSNE with bh setting when it doesn't crash.

openTSNE_fft This is the result of openTSNE with fft settings (where I get the above warning).

sklearn And this (similar) plot I get with sklearn all of the time without warning.

dkobak commented 3 years ago

Wow. None of that makes any sense to me! Unless I am missing something, the points should not overlap at all, so we should see 200 points separately. BH-openTSNE makes most sense, but even there there are many fewer than 200 points visible. In FFT-openTSNE there is 1e11 on the y-axis?! And sklearn simply shows 4 points... Very odd.

pavlin-policar commented 3 years ago

I agree, none of these plots seem correct. The repulsive forces should push the points apart at least a little bit. However, the spans of the embeddings are quite big, so maybe it could be a visual plotting thing that they look like a single point. Are the final coordinates really just a single point, or are they just really close to one another and the large point sizes and large spans just make it look that way?

dkobak commented 3 years ago

It seems that has nothing to do with precomputed specifically, but with how we treat identical points.

np.random.seed(42)

X = np.concatenate((
        np.array(list(np.random.randn(1,10))*50),
        np.array(list(np.random.randn(1,10))*50),
        np.array(list(np.random.randn(1,10))*50),
        np.array(list(np.random.randn(1,10))*50)
    ))

Z = TSNE(negative_gradient_method='bh').fit(X)

print(np.unique(np.sum(Z,axis=1)).size)

This prints 6 :exploding_head:

Update: with sklearn I get points that are almost overlapping but are not identical. However, this result seems fishy to me too.

Update2: with FFT I get 7 distinct points. So points are exactly overlapping.

Update3: I checked and the same thing happens with Uniform affinities, so it's not a perplexity calibration problem.

Also note that adding a minuscule amount of noise like X += np.random.randn(200,50) / 10000 makes the problem go away.

fsvbach commented 3 years ago

The points in FFT-openTSNE for example are something like

      ...      [  1.76457933,  13.81758939],
               [  2.02415051,  13.85560307],
               [  1.94253379,  13.86327763],
               [  2.01328632,  13.71940575],
               [  2.04201515,  13.7215927 ],
               [  2.17067327,  13.8131649], ...

It makes sense to me, that they are very close so that matplotlib overlaps even with points size 1. What doesn't make sense to me: Why doesn't BH-openTSNE show this behaviour, i.e. why are some points very close and some others fairly far apart (and why does it crash, of course)?

Note: In this run I didn't get the RuntimeWarning and the plot (with smaller size) looks like openTSNE_fft Now the y-axis is fine, too.

dkobak commented 3 years ago

I realized that I was wrong when I said that the optimal embedding should not have overlapping points. On reflection, it should be possible to reduce KL to zero by making all points from each class overlap exactly. All pairs of points overlapping in high dimensions have identical affinities, and if they overlap in 2D they will have identical low-dimensional similarities too.

If so, this means that sklearnTSNE returns the correct result.

But I see a lot of thing going weird when optimizing it with openTSNE...

Using the same data as above

np.random.seed(42)
X = np.concatenate((
        np.array(list(np.random.randn(1,10))*50),
        np.array(list(np.random.randn(1,10))*50),
        np.array(list(np.random.randn(1,10))*50),
        np.array(list(np.random.randn(1,10))*50)
    ))

It works fine with starting from random initialization:

TSNE(negative_gradient_method='bh', initialization='random', verbose=1).fit(X)
TSNE(negative_gradient_method='fft', initialization='random', verbose=1).fit(X)

both reduce KL to near-zero and make the points roughly (but not exactly) overlap.

But using PCA initializatoin (where points in the initialization overlap exactly) the optimization goes wild.

TSNE(negative_gradient_method='bh', initialization='pca', verbose=1).fit(X)

BH yield negatative KL values which is mathematically impossible!

===> Running optimization with exaggeration=12.00, lr=200.00 for 250 iterations...
Iteration   50, KL divergence -5.1860, 50 iterations in 0.0159 sec
Iteration  100, KL divergence -6.2852, 50 iterations in 0.0147 sec
Iteration  150, KL divergence -7.0769, 50 iterations in 0.0148 sec
Iteration  200, KL divergence -7.6468, 50 iterations in 0.0149 sec
Iteration  250, KL divergence -8.0897, 50 iterations in 0.0147 sec
   --> Time elapsed: 0.08 seconds
===> Running optimization with exaggeration=1.00, lr=200.00 for 500 iterations...
Iteration   50, KL divergence -9.2070, 50 iterations in 0.0163 sec
Iteration  100, KL divergence -9.7631, 50 iterations in 0.0144 sec
Iteration  150, KL divergence -10.1641, 50 iterations in 0.0152 sec
Iteration  200, KL divergence -10.4828, 50 iterations in 0.0149 sec
Iteration  250, KL divergence -10.7496, 50 iterations in 0.0162 sec
Iteration  300, KL divergence -10.9802, 50 iterations in 0.0150 sec
Iteration  350, KL divergence -11.1838, 50 iterations in 0.0162 sec
Iteration  400, KL divergence -11.3666, 50 iterations in 0.0149 sec
Iteration  450, KL divergence -11.5325, 50 iterations in 0.0143 sec
Iteration  500, KL divergence -11.6847, 50 iterations in 0.0151 sec

While FFT...

TSNE(negative_gradient_method='fft', initialization='pca', verbose=1).fit(X)

...diverges!

===> Running optimization with exaggeration=12.00, lr=200.00 for 250 iterations...
Iteration   50, KL divergence 0.0446, 50 iterations in 0.6821 sec
Iteration  100, KL divergence 0.0292, 50 iterations in 0.7156 sec
Iteration  150, KL divergence 0.0227, 50 iterations in 0.7206 sec
Iteration  200, KL divergence 0.0240, 50 iterations in 0.7323 sec
Iteration  250, KL divergence 0.0195, 50 iterations in 0.6954 sec
   --> Time elapsed: 3.55 seconds
===> Running optimization with exaggeration=1.00, lr=200.00 for 500 iterations...
Iteration   50, KL divergence 0.0321, 50 iterations in 0.6926 sec
Iteration  100, KL divergence 0.0815, 50 iterations in 0.6904 sec
Iteration  150, KL divergence 0.1199, 50 iterations in 0.6936 sec
Iteration  200, KL divergence 0.1419, 50 iterations in 0.6955 sec
Iteration  250, KL divergence 8.1304, 50 iterations in 0.8760 sec
Iteration  300, KL divergence 13.9849, 50 iterations in 0.6924 sec
Iteration  350, KL divergence 14.7327, 50 iterations in 0.6963 sec
Iteration  400, KL divergence 15.2104, 50 iterations in 0.6898 sec
Iteration  450, KL divergence 15.4548, 50 iterations in 0.6824 sec
Iteration  500, KL divergence 15.5808, 50 iterations in 0.7088 sec

dkobak commented 3 years ago

Update: sklearn also shows negative KL values when using PCA initialization (but not with random initialization)!

Z = SklearnTSNE(verbose=2, init='pca').fit_transform(X)

results in

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 200 samples in 0.001s...
[t-SNE] Computed neighbors for 200 samples in 0.006s...
[t-SNE] Computed conditional probabilities for sample 200 / 200
[t-SNE] Mean sigma: 0.000000
[t-SNE] Computed conditional probabilities in 0.034s
[t-SNE] Iteration 50: error = -59.8060608, gradient norm = 0.0029903 (50 iterations in 0.040s)
[t-SNE] Iteration 100: error = -75.7929688, gradient norm = 0.0015364 (50 iterations in 0.018s)
[t-SNE] Iteration 150: error = -85.3008423, gradient norm = 0.0010339 (50 iterations in 0.017s)
[t-SNE] Iteration 200: error = -92.0923309, gradient norm = 0.0007792 (50 iterations in 0.018s)
[t-SNE] Iteration 250: error = -97.3766937, gradient norm = 0.0006251 (50 iterations in 0.017s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: -97.376694
[t-SNE] Iteration 300: error = -10.6954212, gradient norm = 0.0005959 (50 iterations in 0.032s)
[t-SNE] Iteration 350: error = -10.9293900, gradient norm = 0.0005301 (50 iterations in 0.018s)
[t-SNE] Iteration 400: error = -11.2307148, gradient norm = 0.0004560 (50 iterations in 0.020s)
[t-SNE] Iteration 450: error = -11.5423441, gradient norm = 0.0003902 (50 iterations in 0.018s)
[t-SNE] Iteration 500: error = -11.8387394, gradient norm = 0.0003365 (50 iterations in 0.017s)
[t-SNE] Iteration 550: error = -12.1112652, gradient norm = 0.0002936 (50 iterations in 0.017s)
[t-SNE] Iteration 600: error = -12.3597937, gradient norm = 0.0002593 (50 iterations in 0.017s)
[t-SNE] Iteration 650: error = -12.5856562, gradient norm = 0.0002316 (50 iterations in 0.017s)
[t-SNE] Iteration 700: error = -12.7922802, gradient norm = 0.0002089 (50 iterations in 0.016s)
[t-SNE] Iteration 750: error = -12.9815378, gradient norm = 0.0001900 (50 iterations in 0.017s)
[t-SNE] Iteration 800: error = -13.1556988, gradient norm = 0.0001741 (50 iterations in 0.016s)
[t-SNE] Iteration 850: error = -13.3171625, gradient norm = 0.0001606 (50 iterations in 0.023s)
[t-SNE] Iteration 900: error = -13.4675283, gradient norm = 0.0001490 (50 iterations in 0.024s)
[t-SNE] Iteration 950: error = -13.6077213, gradient norm = 0.0001389 (50 iterations in 0.017s)
[t-SNE] Iteration 1000: error = -13.7389517, gradient norm = 0.0001301 (50 iterations in 0.027s)
[t-SNE] KL divergence after 1000 iterations: -13.738952

so it seems that sklearnTSNE and openTSNE have the same problem here.

pavlin-policar commented 3 years ago

Hmm, this is all very strange, and I'll need to look into it some more. But I think that in my implementation of BH, one of the assumptions I made was that the embedding points would never actually overlap like that. So the BH one doesn't surprise me that much. And also, such things are pretty difficult to solve with BH, if memory serves. I am a bit surprised about FIt-SNE not working, since it doesn't make any assumptions about how the points are spaced out.

Perhaps the negative KL values are the most puzzling of all.

I definitely consider this an edge case, and a very unusual usage. Adding just a little bit of noise to the distance matrix makes everything run smoothly.

dkobak commented 3 years ago

Perhaps the negative KL values are the most puzzling of all.

Aren't KL values during BH optimization computed using some expressions computed via BH? If so, then it's not surprising that KL values end up being very wrong.

And also, such things are pretty difficult to solve with BH, if memory serves.

I have no idea about this. But if this is really true, then a dirty workaround would be to add some tiny noise to the initialization (with a warning) if the repulsion method is BH and if there are overlapping points... I'm just not sure what's the best way to detect that there are overlapping points in the initialization.

I am a bit surprised about FIt-SNE not working, since it doesn't make any assumptions about how the points are spaced out.

In fact, we may do the same for FFT as well if we cannot otherwise find out what's going on there.

dkobak commented 3 years ago

I'm just not sure what's the best way to detect that there are overlapping points in the initialization.

Maybe something like

if initialization.shape[0] == np.unique(initialization.round(decimals=6), axis=0).shape[0]

can work to detect if there are duplicates in the initialization array...

dkobak commented 3 years ago

Somewhat fleshing out this workaround:

import numpy as np

X = np.random.randn(1_000_000,2)
X[1000] = X[2000]

def jitter(X, precision=6):
    _, ind_unique = np.unique(X.round(decimals=precision), axis=0, return_index=True)
    ind_duplicates = np.setdiff1d(np.arange(X.shape[0]), ind_unique)
    if ind_duplicates.size > 0:
        print(f'Found {ind_duplicates.size} near-duplicate(s). Adding some jitter to them.')
        X[ind_duplicates] += np.random.randn(ind_duplicates.size, X.shape[1]) * 10**-precision
    return X

%time X = jitter(X)

works in <1 second for n=1mln:

Found 1 near-duplicate(s). Adding some jitter to them.
CPU times: user 872 ms, sys: 12 ms, total: 884 ms
Wall time: 884 ms

So my suggestion would be to run this on any initialization (explicitly provided by the user or computed using PCA) and print a warning when adding jitter.

pavlin-policar commented 3 years ago

Thanks for this solution, I'm actually amazed it works this fast!

At first I was kind of on the fence about this, thinking: "How often would this actually happen?" I think a block distance matrix is something you're really rarely ever see in practice. However, duplicate entries aren't that rare at all, e.g. even in the iris data set, there are two identical rows. However, for iris, this isn't a problem, and the two identical data points get embedded into the exact same position. So that seems to work okay. Then the question becomes "why doesn't it work when there are a lot of duplicate entries?"

For iris, I tried setting a bunch of points to being duplicate entries, and seeing where it failed. Interestingly enough, it failed when I put 30 duplicate points. This doesn't correlate with the perplexity, as increasing it to e.g. 70 still results in negative KL values. So this is definitely very strange. However, setting using 30 duplicate rows and setting perplexity to 70 still results in negative KL values. So it seems like the problem has always been there, but the KL values sometimes don't reach negative values. It also seems like it works if the number of duplicate entries isn't too big.

I'm fine with using your fix as a temporary solution. Clearly, there is a bug somewhere, and it needs to be found. However, this isn't something that would happen very often though I think. If you want to open a PR with this temporary fix, I'll be happy to merge it.

dkobak commented 3 years ago

I'm fine with using your fix as a temporary solution. Clearly, there is a bug somewhere, and it needs to be found. However, this isn't something that would happen very often though I think. If you want to open a PR with this temporary fix, I'll be happy to merge it.

Just want to comment here briefly that I implemented this in a branch but then observed some odd behaviour (I was getting some "near-duplicates" detected in the initialization when there should not have been any) and did not get to investigating it further since then...

pavlin-policar / openTSNE