Auto early exaggeration to prevent collapse for small datasets

pavlin-policar / openTSNE

Extensible, parallel implementations of t-SNE

https://opentsne.rtfd.io

BSD 3-Clause "New" or "Revised" License

1.42k stars 157 forks source link

Auto early exaggeration to prevent collapse for small datasets #234

Closed dkobak closed 1 year ago

dkobak commented 1 year ago

Should fix #233 (in a way). The idea is to set the default early exaggeration to 4 for small-ish (n<10k) datasets, following the original t-SNE paper. I am adding a test that checks if the early exaggeration leads to collapse for small structure-less datasets.

(Am going to push only the test at first, to see if it fails on the current master branch.)

dkobak commented 1 year ago

I can confirm that the implemented test fails on the current master brunch, and passes with the suggested fix.

pavlin-policar commented 1 year ago

It seems to me that since this only happens for BH and not FFT, then there must be an issue with the BH implementation, and this is more of a band-aid than an actual fix to the problem. But, this problem does seem serious enough to warrant a band-aid, since running BH on datasets with N<10k is our default. However, if I understand, this only happens when the datasets don't have any real structure, so this shouldn't be to common. Either way, is this ready to merge?

dkobak commented 1 year ago

Yes, there is definitely some issue with the BH implementation which is triggered when embedding points get too close to each other or overlap exactly (either in the initialization or during the optimization). I have no idea how to fix that, and haven't even looked into it.

At the same time, early exaggeration should not really lead to points collapsing on each other, and it seems that using smaller early exaggeration value for small datasets prevents that. So my thinking is, it's a good idea anyway -- irrespective of the issues with BH. Of course using exag 4 for N<10k and 12 for N>10k seems a bit arbitrary, but then any of those values is a bit arbitrary, and it's all just heuristics. What I like about this specific suggestion, is that both values (4 and 12) are taken from the literature.

However, if I understand, this only happens when the datasets don't have any real structure, so this shouldn't be to common.

Yes, I think triggering this bug is actually difficult. The current master probably works fine for any data with some structure in it.

I would say it's ready to merge... But maybe we should first try it out on some real datasets, just to stress test it a bit. I mean, comparing early exag 12 vs 4 on some real datasets, maybe Iris, maybe some small RNAseq data with <10k points, maybe MNIST subsets. I could try to run some tests.

dkobak commented 1 year ago

I ran some checks using the following code:

def compare(X, exaggerations=[4, 12]):
    fig, axs = plt.subplots(ncols=2, nrows=2, figsize=(8,6), layout='constrained')

    def savekl(it, kldiv, emb):
        global kl
        kl = kldiv

    for num, ex in enumerate(exaggerations):
        Z = TSNE(early_exaggeration=ex, n_iter=0, callbacks=savekl).fit(X['data'])
        axs[0,num].scatter(*Z.T, c=X['target'], s=2)
        axs[0,num].set_title(f'early_exaggeration={ex}, n_iter=0')

        Z = Z.optimize(n_iter=500)
        axs[1,num].scatter(*Z.T, c=X['target'], s=2)
        axs[1,num].set_title(f'n_iter=500')
        axs[1,num].text(.99, .99, f'KL: {kl:.4f}', transform=axs[1,num].transAxes, ha='right', va='top')

    return fig

sklearn.datasets.load_iris

test-early-exaggeration-12-or-4-iris

No difference. Both ex=4 and ex=12 lead to collapse of individual clusters. But this does not affect downstream results.

sklearn.datasets.load_digits

test-early-exaggeration-12-or-4-digits

No difference, no collapse.

sklearn.datasets.load_breast_cancer (standardized)

test-early-exaggeration-12-or-4-breast

No difference for the final result, but ex=12 leads to collapse to 1-dim structure, whereas ex=4 does not.

sklearn.datasets.load_wine (standardized)

test-early-exaggeration-12-or-4-wine

Both values lead to collapse to 1-dim structure, final result not affected.

np.random.randn(100,10)

test-early-exaggeration-12-or-4-toy

Reproduces #233 issue. Strong collapse using ex=12, final results strongly affected due to some problem with the BH optimization.

Random data with some structure

X['data'] = np.random.randn(100,10)
X['data'][50:,0] += 3
X['target'] = np.zeros(100)
X['target'][50:] += 1

test-early-exaggeration-12-or-4-toy2

The same problem as above despite some structure in the data.

My conclusion is that the PR makes sense.

pavlin-policar commented 1 year ago

Wonderful! Thanks for looking into this. Although, I can't say I'm altogether convinced.

I've always considered the rationale for early-exaggeration phase as "it's hard to untangle the data if we use no exaggeration; everything starts off mixed up, and clusters fall apart. Instead, let's squish neighboring points close together, and untangle the clusters from one another. Then, we can run with no exaggeration to decompress the clusters". So conceptually, having clusters collapse into a small ball during EE seems completely fine, as long as they are "decompressed during the standard phase.

If EE=4 did this for all the small data sets, that would be fine, but it seems that it doesn't. In the sklearn.datasets.load_breast_cancer (standardized) example, the top-left panel EE=4 doesn't really compress the data like it does in other examples. It seems strange that it doesn't though, since it compresses the data in all other examples. Could this be a bug? But this could indicate that there might be real-world data sets where EE=4 wouldn't "untangle" the clusters during the EE phase, which is problematic.

I am also a bit reluctant to make this change because this will lead to different embeddings for small data sets than from what we produce now. Yes, the embeddings will show the same things, but they won't be identical, and there will inevitably be questions from users as to what's going on.

There doesn't really seem to be anything really wrong with using EE=12 everywhere, other than that we trigger the BH bug on some rare, synthetic datasets. EE=4 just avoids triggering that particular bug. I would much rather fix the BH bug and keep things as they are, to avoid unnecessary downstream changes in the embeddings. I'd rather have a crack at fixing the BH implementation first to see if we can't fix the real bug there. If we can't figure it out, we'll merge this PR.

dkobak commented 1 year ago

I agree. If the bug with the BH implementation is fixed, then there is no need for this PR. I was thinking that EE=4 is objectively better for small datasets than EE=12, but after looking at these examples I see that sometimes it's better, sometimes not, and sometimes both values are clearly both not ideal. Ideally I think one would have a more sophisticated heuristic for setting the EE, but we don't have any, so then one should rather keep the current EE=12 even if in some cases it's not great.

However, I don't think I will have time to look into the BH optimization any time soon.

pavlin-policar commented 1 year ago

I think I've found the bug. All I had to do was change the threshold for duplicate detection in the BH implementation and the problem seems to go away. @dkobak dould you please confirm this fixes the issue on your end too? Note that you'll have to recompile the binaries with python setup.py build_ext --inplace for the change to take effect.

dkobak commented 1 year ago

Oh this is really cool! If test_early_exaggeration_does_not_collapse passes, then I don't have any other examples where the bug is triggered, so there is nothing to check on my end.

pavlin-policar commented 1 year ago

I did end up changing the test a bit. Your version checked for collapse after the EE phase, but I check for collapse only after the entire optimization process. It seems to me that it doesn't really matter if the embedding becomes really small at some point, as long as the final embedding is fine.

dkobak commented 1 year ago

Yes I agree.