Switching spectral initialization to sklean.manifold.SpectralEmbeddings

dkobak commented 1 year ago

A student in our lab is currently looking into spectral initialization, and she found out that openTSNE.init.spectral(tol=0) in some cases does not agree to sklearn.manifold.SpectralEmbedding(affinity='precomputed'). In some cases it does agree perfectly or near-perfectly, but we have an example when the result is very different, and SpectralEmbedding gives what seems like a more meaningful result.

I looked at the math, and it seems that they should conceptually be computing the same thing (SpectralEmbedding finds eigenvectors of the L_sym, whereas init.spectral finds generalized eigenvectors or W and D, but that should be equivalent, as per https://jlmelville.github.io/smallvis/spectral.html @jlmelville). We don't know what the difference is due to. It may be numerical.

However, conceptually, it seems sensible if openTSNE would simply outsource the computation to sklearn.

A related issue is that init.spectral is not reproducible and gives different results with each run. Apparently the way we initialize v0 makes ARPACK to still have some randomness. Sklearn gets around this by initializing v0 differently. I guess openTSNE should do the same -- but of course if we end up simply calling SpectralEmbedding then it won't matter.

jlmelville commented 1 year ago

@dkobak don't want to hijack this issue so feel free to contact me directly, but I am curious:

is the example a simulation dataset or have any other unusually regular structure?
do you know what codepath and solver the spectral embedding ends up using?
have you been able to look at the eigenvalues that come out? No chance that the solver has accidentally skipped an eigenvalue which just happens to give a nicer looking embedding? The discussion at https://github.com/yixuan/spectra/issues/126 suggests that for theoretical reasons with shift-and-invert (which spectral embedding uses with the codepath that ends up calling eigsh), if the shift value matches the eigenvalue then it will be skipped. I can't say that arpack's version suffers from that but might be worth checking.
any noticeable differences in run time?
assuming you are using the uniform affinity input, does adding a small number of random values (while maintaining symmetry) with small values (e.g. 1e-3) to the affinity matrix before creating the Laplacian have any affect on the result? In my experience, this can stabilize the convergence of a numerically problematic Laplacian without hugely affecting the initial embedding, although this seems to be an issue with UMAP graphs more than t-SNE, presumably due to the former being a lot sparser (also the UMAP tolerance is much looser). Might be worth a try though.

dkobak commented 1 year ago

Hi James, thanks for the interest. This is a tiny graph with very pronounced geometric structure: https://sparse.tamu.edu/HB/can_96. The structure is a ring: https://sparse-files-images.engr.tamu.edu/HB/can_96_graph.gif

Here is the entire code to reproduce the issue (download and unpack the Matrix Market data from the link above):

%matplotlib notebook

import numpy as np
import pylab as plt

from scipy.io import mmread
adjacency = mmread('/home/dmitry/Downloads/can_96.mtx').toarray()

affinity = adjacency / np.sum(adjacency, axis=1, keepdims=True)
affinity = (affinity + affinity.T) / 2
affinity /= np.sum(affinity)

from openTSNE.initialization import spectral
Z1 = spectral(affinity, tol=0)

from sklearn.manifold import SpectralEmbedding
Z2 = SpectralEmbedding(affinity='precomputed').fit_transform(affinity)

plt.figure(figsize=(8, 4), layout='constrained')
plt.subplot(121)
plt.title('openTSNE')
plt.scatter(Z1[:,0], Z1[:,1])
plt.subplot(122)
plt.title('sklearn')
plt.scatter(Z2[:,0], Z2[:,1])
plt.savefig('spectral.png', dpi=200)

spectral

SpectralEmbedding is run with default parameters so I assume it's using arpack solver via eigsh. I have not checked the eigenvalues or anything else yet.

dkobak commented 1 year ago

Wow, right after posting the above comment, I found out that the openTSNE's result only arises due to particular v0:

    v0 = np.ones(A.shape[0]) / np.sqrt(A.shape[0])
    eigvals, eigvecs = sp.linalg.eigsh(
        A, M=D, k=k, tol=tol, maxiter=max_iter, which="LM", v0=v0
    )

If I simply remove this v0, then the result looks identical to the sklearn! So it seems that this v0 choice not only leads to non-reproducible results, but also screws up the eigenvector calculation!!

Sklearn uses this: https://github.com/scikit-learn/scikit-learn/blob/dc580a8ef5ee2a8aea80498388690e2213118efd/sklearn/utils/_arpack.py#L4

    v0 = random_state.uniform(-1, 1, size)

so openTSNE should probably do the same.

Or we simply call SpectralEmbedding(affinity='precomputed', eigen_tol=tol).fit_transform(A).

dkobak commented 1 year ago

More details on this:

D = np.diag(np.ravel(np.sum(affinity, axis=1)))
from scipy.sparse.linalg import eigsh
v0 = np.ones(affinity.shape[0]) / np.sqrt(affinity.shape[0])

eigvals, _ = eigsh(affinity, M=D, k=3, tol=0, which="LM", v0=v0)
print(eigvals)

eigvals, _ = eigsh(affinity, M=D, k=3, tol=0, which="LM")
print(eigvals)

results in

[0.90591439 0.94925302 1.        ]
[0.94925302 0.94925302 1.        ]

dkobak commented 1 year ago

Last comment, promise: using sigma=1.0 in the eigsh call (following sklearn) speeds up the calculation by 3x (for this particular graph):

%time eigvals, _ = eigsh(affinity, M=D, k=3, tol=0, which="LM")
%time eigvals, _ = eigsh(affinity, M=D, k=3, tol=0, which="LM", sigma=1.0)

So if we keep our own implementation, then using sigma=1.0 may be a good idea.

jlmelville commented 1 year ago

Thanks for all those details Dmitry. I had a feeling it would be something weird with a very structured graph. It is indeed skipping an eigenvector for some choices of v0, presumably because the eigenvalues are degenerate?

In this case, the result seems to be exceptionally sensitive to v0 in a way I don't understand. The init.spectral v0 should be a good guess because the vector of 1s is a (generalized) eigenvector (the one that gets discarded). Using np.ones directly rather than scaling that vector to length 1 gives the correct result -- I'd be curious to know if that also works for you Dmitry. I am seeing similarly odd behavior when using eigsh with the symmetrized graph Laplacian directly also.

I have never had good luck with setting sigma. In my experience with it, in combination with sparse data, it seems to cause the routine to hang forever. Have you tried it with something like MNIST?

dkobak commented 1 year ago

Using np.ones directly rather than scaling that vector to length 1 gives the correct result -- I'd be curious to know if that also works for you Dmitry.

Yes it does! Weird. Sklearn does mention here https://github.com/scikit-learn/scikit-learn/blob/dc580a8ef5ee2a8aea80498388690e2213118efd/sklearn/utils/_arpack.py#L7 that the scale of initialization may be important... However, at this point I do not trust this approach and would rather have v0 initialized randomly.

I have never had good luck with setting sigma. In my experience with it, in combination with sparse data, it seems to cause the routine to hang forever. Have you tried it with something like MNIST?

Just tried now, by generating the standard t-SNE affinity matrix for MNIST (so 90 neighbors for each point).

eigsh(affinity, M=D, k=3, tol=0, which="LM")
eigsh(affinity, M=D, k=3, tol=0, which="LM", sigma=1.0)
SpectralEmbedding(affinity='precomputed').fit_transform(affinity)

The first call took 11 s. The other two did indeed hang forever and I stopped both of them after a few minutes.

However, using Pyamg solver

SpectralEmbedding(affinity='precomputed', eigen_solver='amg').fit_transform(affinity)

also took 11 s. This may be due to looser tolerance though, as the docs mention that amg solver sets its own tolerance which is above 0: https://scikit-learn.org/stable/modules/generated/sklearn.manifold.SpectralEmbedding.html.

Just noticed that setting the tolerance for the ARPACK solver in SpectralEmbedding is only possible in version 1.2.

Okay, so if we don't want to depend on sklearn 1.2, then it seems that maybe the best course of action is simply to kick out v0 and keep everything else as is, including NOT using sigma=1.0. Thoughts?

jlmelville commented 1 year ago

On a technical basis, I suspect in most real-world cases, the current choice of v0 is good and it will probably be a small performance pessimization to use a random v0. I am not sure if many people are going to hit this problem. If they do, they are probably researchers like yourself, so the current workaround (initialize manually and hence have control of the spectral settings) might be ok?

dkobak commented 1 year ago

Do you have any evidence that the current choice of v0 is helpful and that a random v0 would lead to slower convergence? I don't see any difference on MNIST affinities.

jlmelville commented 1 year ago

With initialization.spectral specifically, no. But with eigsh on the shifted symmetric graph Laplacian, in general I have found that guessing the lowest eigenvalue usually helps. However, it's not consistent.

pavlin-policar commented 1 year ago

It seems to me that if sklearn doesn't have these pitfalls, and produces the same results as our implementation, it would make more sense to just switch over to their implementation. I don't really know any of the details of the numerical solvers and it could be a real hassle to maintain if we find similar bugs to this in the future. What do you think?

dkobak commented 1 year ago

It seems to me that if sklearn doesn't have these pitfalls, and produces the same results as our implementation, it would make more sense to just switch over to their implementation. I don't really know any of the details of the numerical solvers and it could be a real hassle to maintain if we find similar bugs to this in the future. What do you think?

That was my original thinking here, however as @jlmelville pointed out, sklearn implementation can be VERY SLOW unless one either sets some non-zero tol (which is only available from sklearn 1.2, and I have not tested the runtime of this yet), or uses amg solver (which is only available if pyamg is installed, and apparently does not allow tol=0 at all, as it sets its own tolerance).

So the question is if you want to depend on either sklearn 1.2 or pyamg.

I also like the simplicity of simply running eigsh(affinity, M=D, k=3, tol=tol, which="LM") which seems to be much faster than sklearn, allows to use tol=0 or tol>0, and is only one line of code... It was a bit of a pain to figure out what exactly sklearn is computing, mathematically, as it partially goes through scipy etc.

pavlin-policar commented 1 year ago

Okay, that's fine by me. I'm not going to pretend I know what this does, but I'm sure you've tested it out thoroughly and I'll leave it to your judgement :) Maybe it would be best to open up a PR with this change so we can get this merged ASAP.

dkobak commented 1 year ago

It is part of https://github.com/pavlin-policar/openTSNE/pull/225.

But just to clarify, the only edit is that it replaces

    v0 = np.ones(A.shape[0]) / np.sqrt(A.shape[0])

with

    # v0 initializatoin is taken from sklearn.utils._arpack._init_arpack_v0()
    random_state = check_random_state(random_state)
    v0 = random_state.uniform(-1, 1, A.shape[0])

before running

    eigvals, eigvecs = sp.linalg.eigsh(
        A, M=D, k=k, tol=tol, maxiter=max_iter, which="LM", v0=v0
    )

So the only edit is in v0. The way we had it before led to nonreproducible results and other weird behaviour.

pavlin-policar / openTSNE

Switching spectral initialization to sklean.manifold.SpectralEmbeddings #223