lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.32k stars 796 forks source link

spectral_layout gets stuck on eigsh #76

Open andreas-kopecky opened 6 years ago

andreas-kopecky commented 6 years ago

On some datasets (in this case sentence embeddings, 25d vectors) spectral_layout gets stuck on eigsh. Dataset size shape in my case: (50000, 25) metric: cosine I have been hunting this for a while now and if you do not mind i will try to implement this differently using truncated svd instead of eigsh and make a pull request. In any case eigsh does not seem to raise "ArpackNoConvergence" which it should if optimization gets stuck (i will raise this with scipy community). As a sidenote: this is not really reproducable for all datasets - some datasets work, some don't. Changing n_neighbors (i.e. altering the graph) solves the issue for some datasets - not for others. Random init works just fine on all datasets i have tried.

In any event i really love UMAP and use it as drop in replacement for t-SNE and as dimensionality reduction kit of clustering, anomaly detection and topological data analysis (i.e. as a lens for the MAPPER algorithm)

lmcinnes commented 6 years ago

I have run into this very intermittently as well, and struggled to manage to track down the root cause (too small an eigengap obviously, but why?). If you have an alternative approach that will work I will gladly accept a pull request. One thing to note is that the randomized approach for truncated SVD in sklearn can be ... problematic in some cases -- it can produce very poor/unstable results; using ARPACK as the solver fixes this, but then we may be back in the same situation.

Thanks for any help you have to offer on this!

andreas-kopecky commented 6 years ago

Thanks for the tip about sklearn - that saves me testing this one :) One thing i suspect and want to evaluate is, if rank of input data has anything to do with it. In my samples i definitely have linear colinearity (since it is sentence embeddings and sometimes sentences are said multiple times) and i seem to have less problems if i use "unique" entries only. I will further investigate this however. Thanks a lot for the fast reply and confirmation that i am not crazy...

lmcinnes commented 6 years ago

You are definitely not crazy; the SVD approach may still be fine even with the randomized approach, I'm just wary as I have had some issues with it before in some cases.

On Tue, Jun 19, 2018 at 10:35 AM, Andreas Esders-Kopecky < notifications@github.com> wrote:

Thanks for the tip about sklearn - that saves me testing this one :) One thing i suspect and want to evaluate is, if rank of input data has anything to do with it. In my samples i definitely have linear colinearity (since it is sentence embeddings and sometimes sentences are said multiple times) and i seem to have less problems if i use "unique" entries only. I will further investigate this however. Thanks a lot for the fast reply and confirmation that i am not crazy...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/76#issuecomment-398421628, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBZvYrx5la4wY7TW86NUgbRqTSJzqks5t-QwmgaJpZM4UtCyF .