lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.46k stars 808 forks source link

Inquiry on Utilizing UMAP for Text Similarity and Clustering #1113

Open dsdanielpark opened 7 months ago

dsdanielpark commented 7 months ago

Hello,

I would like to express my sincere appreciation for your passionate communication and efficient package management. I have reviewed the documentation and code related to the use of UMAP, but find myself in need of expert advice.

My intention is to use UMAP for clustering and measuring the similarity between arrays of sentence embeddings. There are no labels associated with this data, and I have several questions about this process. Additionally, I wish to logically discuss how the text similarity results compare to the outcomes provided by UMAP.

  1. I am curious whether UMAP could guarantee better results compared to DBSCAN.
  2. In the absence of labels, do you have any advice on efficiently setting the dimensions and point distances (args) in UMAP?

Any keywords, references, or preliminary answers you could provide would be greatly appreciated.

Thank you once again for your wonderful project.

lmcinnes commented 7 months ago

Thank you for the kind words.

Mostly what UMAP will buy you over using DBSCAN directly on the embedding vectors is a lot more of your data clustered while still having reasonably fine-grained clusters. Can I guarantee better results? I think there are no guarantees, especially in unsupervised learning. Would I expect better results if you use UMAP first and then DBSCAN or HDBSCAN? Yes, I definitely would.

Choosing parameters is always going to come down to the data you have, the kinds of results you want to get, and what you are going to use the clustering for from there. Some rules of thumb: n_components=5 is a good starting point for clustering. It is enough dimensions that UMAP has a much easier time resolving tangles etc. in the optimization, but still pretty low. I would not choose n_components larger than n_neighbors (or really larger than 20 even if you have a very large n_neighbors). The choice of n_neighbors is going to strongly influence the granularity of the clustering. The smaller the value the more fine grained the resolution of clusters you'll tend to get out (assuming DBSCAN or HDBSCAN for clustering the UMAP output). As for metric, the usual choice for sentence embeddings is "cosine"; if you want to try something a little different then import pynndescent and use pynndescent.distances.alternative_cosine which is a small tweak on cosine distance that may work better for your use case with UMAP.

dsdanielpark commented 7 months ago

Thank you for your kind response. I'll start as you suggested!

Can UMAP be updated in batches? Is it possible to create a UMAP model for large images and further train it? It seems impossible due to UMAP's mechanics, but I wonder if implementing this feature would be difficult.

lmcinnes commented 7 months ago

I think for that use case you might want to look into ParametricUMAP. UMAP does have an update method, but it is definitely not the same as training on the full dataset.

dsdanielpark commented 7 months ago

Thank you for your response! I will check this Parametric UMAP!