pavlin-policar / openTSNE

Extensible, parallel implementations of t-SNE
https://opentsne.rtfd.io
BSD 3-Clause "New" or "Revised" License
1.42k stars 157 forks source link

Adding tiny amount of noise to PCA/spectral init to prevent points from overlapping #224

Closed dkobak closed 1 year ago

dkobak commented 1 year ago

A student encountered an unusual situation with spectral initialization: due to the specifics of the dataset, there were points that were EXACTLY overlapping in the initialization. This made the points "stuck" to each other forever -- these points felt the same repulsive force to all other points and so could not separate from each other, even though there should actually be repulsion between them. Adding a tiny amount of random noise to the initialization solved this problem and made points spread over the embedding, as expected.

This reminded me of another issue we discussed a while ago https://github.com/pavlin-policar/openTSNE/issues/180 (still open) where points exactly overlapping in the initialization were causing some problems.

My suggestion is to always add a tiny amount of noise to all initializations that we compute. Specifically, in the rescale() function here https://github.com/pavlin-policar/openTSNE/blob/master/openTSNE/initialization.py#L9 I would replace

x /= np.std(x[:, 0]) * 10000

with

x /= np.std(x[:, 0]) * 10000
x += np.random.randn(*x.shape) * 1e-5

This would affect PCA and spectral init, but would not affect a user-provided init.