Spectral initialization seems to get stuck sometimes

jlmelville commented 5 years ago

Sometimes the spectral initialization takes a long time; on some occasions, it's got so stuck I've had to terminate the calculation (which can sometimes require terminating the R session).

Probably the input matrix is very poorly conditioned, so that finding the smallest eigenvalues is an exercise in numerical futility.

Recent versions of UMAP detect connected components and initialize them separately, see e.g. https://github.com/lmcinnes/umap/blob/43cf1a820cea8d5b3218627d047dd78e4a152dd4/umap/spectral.py

This might solve the problem.

lmcinnes commented 5 years ago

Note that I also had some luck using different algorithms (LOBPCG) for some cases. I've managed to have some more detailed advice as well (on building suitable preconditioning matrices etc.) but mostly this has not been a big enough problem/bottleneck to go down that particular rabbit hole.

On a different front I am also looking into a bootstrapping style initialisation, or something similar to PAGA. That's an issue for further down the road however.

jlmelville commented 5 years ago

I had a look around for LOBPCG routines, but could not find an R implementation (although it seem to have been implemented everywhere else). PAGA does indeed look interesting, but I suspect I will be lucky if there's an R-accessible version of that too.

lmcinnes commented 5 years ago

I was expecting to have to implement my own version of PAGA style routines. That is among the reasons it is low on my priority list -- it's a lot of work.

jlmelville commented 5 years ago

@lmcinnes, I haven't had time to dig into the meta-embedding routines fully, but it looks like one of the code paths involves creating the Laplacian Eigenmap based on applying a Gaussian kernel to the distance matrix between the centroids of the connected components. The centroids seem to be taken from the original data itself: what happens with non-continuous features (e.g. 0/1 binary data)? Also, how do you see this playing into mixed data types: I assume you would use just the subset of X that applies to that type when calculating the centroid.

Do you consider this a key part of UMAP initialization and would feel strongly that UMAP was not being properly represented if I didn't implement this in uwot; or could I just bail out to random initialization when confronted with disconnected components?

I could try to justify my lazy approach by arguing that if the graph isn't fully connected, then a global view of the data can't be provided, and the user should increase n_neighbors, or use a different form of initialization, like the potential PAGA-style initialization of the future.

jlmelville commented 5 years ago

I've made a change so that disconnected components are initialized separately and the results merged, but with no attempt to stop them from overlapping. I squinted at the PAGA code in scanpy and it seems that it uses ForceAtlas2 or igraph to deal with positioning disconnected components. In turn, there doesn't seem to be anything particularly magic about their way of dealing with disconnections, so I am not going to purse that further for now.

For some problem cases I identified, the recursive initialization does help, but there are still some results where the spectral initialization takes way too long (several hours), so I need to take a closer look at RSpectra's parameters.

jlmelville commented 5 years ago

Although the shift-and-invert mode (which = "LM", sigma = 0) in RSpectra is "more stable" for finding small eigenvalues (according to the help text), for the default UMAP settings with MNIST, it takes about 24 minutes to complete one iteration, compared to less than one second with the more straight-forward which = "SM" setting.

However, under the circumstances where it doesn't misbehave, it can converge where using which = "SM", doesn't. So the new strategy for the spectral initialization is to try the "SM" method, and if that fails to converge, fall back to the "LM". In all my testing, this, combined with initializing connected components separately, prevents the hours-long initializations that I have previously come across.

That'll do for now.

lmcinnes commented 5 years ago

Sorry for being late getting back to you. The meta-embedding approach is not really viable for mixed data and other games (such as non-continuous data). A medoid rather than centroid based approach would be the best option there, although it certainly gets tricky with mixed data types.

In an ideal world I would like an approach something like this: Use graph clusters (could be connected components, could be Louvain, or something similar) an generate representatives for each cluster (centroids, medoids, something else again) and then embed the representatives using UMAP (which requires an exact UMAP designed for small data like you have in smallvis). Generate an initialisation from there based on an approximate UMAP or spetral for each component, scaled appropriately to avoid overlaps, and then run the optimization from there. Details to be worked out of course, and implementation-wise this can get non-trivial.

jlmelville commented 5 years ago

A fill UMAP on a reduced dataset of medioids is interesting. The only difficulty I see with it is deciding on a value for n_neighbors.

OuNao commented 5 years ago

Hi! Can you revisit this issue?

I´m using this wonderfull package to do dimension reduction on flow cytometry data. The data has about 90000 rows (cells) and 30 columns. The 31st column is a "label" column.

Using unsupervised uwot::umap to do dimension reduction on this data with init="spectral" I get the best separation of the populations. But, using the "label" column as y to do a supervised embedding the R get stuck on initialization. I can do supervised embedding with the other init options except spectral and normlaplacian (laplacian do works).

What do you think about this issue?

Thanks,

jlmelville commented 5 years ago

I will take another look, although the options are limited.

I can add a new init option that does what init = "spectral" does but using the irlba package instead of RSpectra. irlba describes the feature as "somewhat experimental, and may produce poor estimates for ill-conditioned matrices", but it might be good enough.

I'll update the issue when/if this feature is added.

jlmelville commented 5 years ago

@OuNao, as of commit https://github.com/jlmelville/uwot/commit/2336a9012c91d9663efc5d2537ce2c9960a8357a, I have added (an undocumented) option for init = "ispectral", which does the same as init = "spectral" but uses irlba instead of RSpectra.

It will almost certainly run a lot slower than RSpectra does under normal circumstances, but it might complete under conditions where RSpectra hangs, which might be an acceptable trade-off. On my machine, for MNIST and FMNIST (which have dimensions of N = 70,000 and p = 768), the current settings for init = "ispectra" take about 4 minutes to complete (compared to a few seconds with RSpectra).

If you are able to, can you let me know if installing the latest version of uwot from master and using init = "ispectra" with your dataset at least doesn't hang? To detect it's still doing stuff, it will output several lines to the console (even if you have set verbose = FALSE), that look like:

iter= 488, mprod= 6838, sv[3]=9.70e-03, %change=-2.46e-04, k=3

I've set it to try up to 1000 iterations. For some environments you might not see the output until it's entirely finished, but for example in RStudio, you should see each line appear as it's calculated, which should at least let you distinguish between whether it's hanging or just still grinding away.

OuNao commented 5 years ago

Thanks for the prompt response.

Yes, with the ispectral init option the initialization occurs with no crash, but slower.

On my old i3 notebook I get: ispectral 13.67415 mins laplacian 6.450384 mins

I will try some different training sets to know if the better population separation with the ispectral option is worth the loss of performance.

Thank you again.

jlmelville commented 5 years ago

Thank you for checking on that.

Possibly I can find a set of parameters with irlba that would make it acceptably fast, but as RSpectra takes only a few seconds (at least when it's working) and irlba is taking several minutes that could be a tough proposition.

Some ways forward would be to implement LOBPCG in R, or to do some digging in what might be causing RSpectra to be so slow. Both are long-term projects.

jlmelville commented 5 years ago

@OuNao, if you are still having problems with excessively long spectral initializations, I have pushed a new initialization option to master, init = "agspectral", which might help.

This approach is the same as init = "spectral", except all the entries in the graph are set to 1, and then a small number of zero entries in the graph are set to 0.1, to represent long-range affinities. Then the spectral layout is carried out as before.

The advantages are:

For all the datasets I've tried it with, the initialization completes in 2-3 seconds at most, including scenarios where the usual "spectral" initialization seems to hang.
For datasets where default settings lead to disconnected components, this approaches leads to a single fully connected graph, due to the addition of the random, lower-weighted non-neighbor edges. This should therefore include some global structure. It's always hard to tell if imaginary global structure is being added, but at least with the iris dataset, the layout was similar to that obtained from PCA.
Unlike the "spca" initialization, the initialization takes into account information from the graph, so some influence of the mixed metrics (and any y data in the supervised case) on the initial layout should be retained.

Results in many cases seemed to be quite similar to the "spectral" results (when they work), so it might be worth a shot. But bear in mind that here is no theoretical justification for this approach, although it was very loosely inspired by reading Randomized Near Neighbor Graphs, Giant Components, and Applications in Data Science and 2-D Embedding of Large and High-dimensional Data with Minimal Memory and Computational Time Requirements.

"agspectral" stands for "approximate global" spectral initialization. Suggestions for a better name would be welcome.

OuNao commented 5 years ago

@jlmelville ,

The new "agspectral" seems as fast and good for me as "spectral" (but with no crash in supervised embedding!!!).

Thank you again!

jlmelville / uwot

Spectral initialization seems to get stuck sometimes #16