pavlin-policar / openTSNE

Extensible, parallel implementations of t-SNE
https://opentsne.rtfd.io
BSD 3-Clause "New" or "Revised" License
1.48k stars 165 forks source link

Improve PCA and spectral inits #225

Closed dkobak closed 1 year ago

dkobak commented 1 year ago

A tiny PR that would fix #224, #223, and #180. I am not insisting on any of these changes, but thought I make the PR to make it easier to discuss.

dkobak commented 1 year ago

A lot of tests fail because we would need to fix random seeds over there. I'm not doing it for now -- let's discuss first.

pavlin-policar commented 1 year ago

I think adding a tiny amount of noise to each initialization is fine if it solves all these issues. I'm used to calling this jitter, so that's what I'll call it. However, I think the user should be able to turn this off. For instance, we could add a general parameter TSNE(..., jitter_initialization=True) so that the user can still turn this off. I would also put this into its own separate function e.g.

def jitter(x, scale=0.01, random_state):
    std = np.std(x[:, 0])
    target_std = std * scale
    return x + random_state.normal(0, std, x.shape)

The we can call this pca, spectral, and any other initializations we have. I definitely wouldn't add this to the rescale function as that could be very confusing having rescale also actually change our initialization. We can turn this off by default for precomputed embeddings, but this might be a pain, because then we'd again need to implement something like jitter_initialization="auto" so we can figure out if we need to jitter or not. Maybe it would be easier to just enable jittering by default for all initializations, and if the user wants, they can turn this off manually. We'd need to document this behaviour in the initialization parameters.

What do you think?

dkobak commented 1 year ago

Sure, let me move jitter out of rescale() into jitter().

I also agree that there should be a way to switch it off. What about adding add_jitter=True parameter to pca() and to spectral()? Then the user would be able to switch jitter off by manually calling e.g. pca(add_jitter=False).

I am not so sure about having the global parameter like TSNE(..., jitter_initialization=True), because indeed I think it would make more sense not to add jitter to a user-provided initialization.

pavlin-policar commented 1 year ago

What about adding add_jitter=True parameter to pca() and to spectral()? Then the user would be able to switch jitter off by manually calling e.g. pca(add_jitter=False).

Yes, this is exactly what I had in mind. We should add this to both pca and spectral. The other two -- weighted_mean and median can only be used when adding new points to an embedding, and they ignore one another, so it doesn't make any difference if we jitter those, so I wouldn't add the jitter parameter to those.

I am not so sure about having the global parameter like TSNE(..., jitter_initialization=True), because indeed I think it would make more sense not to add jitter to a user-provided initialization.

So you'd want to have jitter on by default for pca and spectral, but have it off for precomputed? The issue I see with that is that the user then can't use standard pca without jitter without having to precompute the initialization, then passing it through the constructor.

It might be fine though. The TSNE constructor has enough parameters as it is, and nobody will notice this in practice, since the added noise should be very small. Okay, I think this is fine for now then, and we'll add the parameter later, if we find any need for it.

dkobak commented 1 year ago

Agree. I refactored the jitter code, and also added a test that checks that our spectral init coincides with sklearn spectral embedding. Luckily, it passes :-)

I also had to fix random seeds in some existing tests that used PCA/spectral initializations.

pavlin-policar commented 1 year ago

Great, thanks! I'll close the related issues.