lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.4k stars 805 forks source link

UserWarning: WARNING: spectral initialisation failed! The eigenvector solver failed. This is likely due to too small an eigengap. Consider adding some noise or jitter to your data. Falling back to random initialisation! warn( #895

Closed AnantshreeChandola closed 2 years ago

AnantshreeChandola commented 2 years ago

While creating umap embeddings for HDBSCAN clustering, I am getting this user warning, `UserWarning: WARNING: spectral initialisation failed! The eigenvector solver failed. This is likely due to too small an eigengap. Consider adding some noise or jitter to your data.

Falling back to random initialisation! warn(` The clusters that are created from these embeddings have multiple duplicate clusters. Why this could be happning, need clarification.

lmcinnes commented 2 years ago

This is a warning that there is something odd about your data (possibly a lot of duplicates?) and part of the processing failed. It will fall back to other methods, so it will still run, but is not ideal. It is likely worth looking at your data via other means, checking for duplicates, using unique=True in UMAP, etc. to see if there is anything strange in your data.

On Wed, Aug 3, 2022 at 6:55 AM Anantshree Chandola @.***> wrote:

While creating umap embeddings for HDBSCAN clustering, I am getting this user warning, `UserWarning: WARNING: spectral initialisation failed! The eigenvector solver failed. This is likely due to too small an eigengap. Consider adding some noise or jitter to your data.

Falling back to random initialisation! warn(` The clusters that are created from these embeddings have multiple duplicate clusters. Why this could be happning, need clarification.

— Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/895, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3IUBKY2K6ECYIYZKC2R5LVXJFYRANCNFSM55OMKRUA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

AnantshreeChandola commented 2 years ago

When I ran UMAP while keeping the parameter 'unique' as True, I got no duplicate clusters with those embeddings. Also, If I just run my clustering algorithm on the initial vector (without any dimensions reduction using UMAP), I am getting no duplicate clusters. It's only the case(Duplicate clusters), when I use UMAP with n_neighbors = 15 and min_dist = 0.0 and no other parameter, I would like to understand what goes behind the other method when this user warning is generated, so as to understand what might be happening in the backend that is causing duplicacy of clusters.

lmcinnes commented 2 years ago

UMAP generates a graph with weighted edges where the edge weights relate to relative distances of neighboring points. You can see the UMAP documentation for more detail on this process. The initialization that is failing is an attempt to find eigenvectors of the (symmetric) Laplacian of that graph. The usual technique for this is to use a power method approach as used by scipy. Scipy, in turn is relying on ARPACK for this. In practical terms all UMAP is seeing is that ARPACK returns and error when attempting to calculate the eigenvectors of the Laplacian. Usually this is related to poor convergence due to a very small gap between eigenvalues so the power method fails to separate out eigenvectors easily. That may or may not be the actual cause in your case -- you would have to interrogate the specifics of the ArpackError to be sure.

UMAP does save the graph in the graph_ attribute, so you can actually walk through the code of the spectral layout and catch the actual error and see if that provides more information if you wish.

AnantshreeChandola commented 2 years ago

Found the issue. So, My dataset has a lot of similar (duplicate) messages, which is why I am getting this warning and this is causing major issues in creating embeddings. It is so bad that the clusters which were formed had exact similar messages grouped in different clusters (I had around 77 extra clusters formed for one single message). However, with other embeddings that I used, I am assuming, I got all these messages grouped in a single cluster( and because I am using a set to store those, I did not see any duplicacy there). Now, my question is, I want to use UMAP embeddings, because for other datasets, the performance is really great. But, I also don't want to have bad results with data where so many duplicates are there. Is there a way I can improve this considering I do not want to preprocess the data and delete all duplicates (I only want to remove duplicates from final clusters formed). Also, Thanks a lot for replying so quickly on this query.

lmcinnes commented 2 years ago

I think the unique=True option is the best approach; it will find duplicates, remove them for the purpose of learning the embedding, and then place them down exactly as duplicates of their corresponding point in the embedding. So you can run UMAP with duplicates, but not break things -- that was the intended use case.

Of course if you want the fact that you have a lot of duplicates to matter for the learned embedding the other simple approach is to simply add a small amount of noise to the whole dataset (such that the scale of the noise is smaller than most variation among (non-duplicated) samples).

AnantshreeChandola commented 2 years ago

Okay, Thanks a lot for all the insights.