Open dsdanielpark opened 7 months ago
Thank you for the kind words.
Mostly what UMAP will buy you over using DBSCAN directly on the embedding vectors is a lot more of your data clustered while still having reasonably fine-grained clusters. Can I guarantee better results? I think there are no guarantees, especially in unsupervised learning. Would I expect better results if you use UMAP first and then DBSCAN or HDBSCAN? Yes, I definitely would.
Choosing parameters is always going to come down to the data you have, the kinds of results you want to get, and what you are going to use the clustering for from there. Some rules of thumb: n_components=5
is a good starting point for clustering. It is enough dimensions that UMAP has a much easier time resolving tangles etc. in the optimization, but still pretty low. I would not choose n_components
larger than n_neighbors
(or really larger than 20 even if you have a very large n_neighbors
). The choice of n_neighbors
is going to strongly influence the granularity of the clustering. The smaller the value the more fine grained the resolution of clusters you'll tend to get out (assuming DBSCAN or HDBSCAN for clustering the UMAP output). As for metric
, the usual choice for sentence embeddings is "cosine"; if you want to try something a little different then import pynndescent and use pynndescent.distances.alternative_cosine
which is a small tweak on cosine distance that may work better for your use case with UMAP.
Thank you for your kind response. I'll start as you suggested!
Can UMAP be updated in batches? Is it possible to create a UMAP model for large images and further train it? It seems impossible due to UMAP's mechanics, but I wonder if implementing this feature would be difficult.
I think for that use case you might want to look into ParametricUMAP. UMAP does have an update
method, but it is definitely not the same as training on the full dataset.
Thank you for your response! I will check this Parametric UMAP!
Hello,
I would like to express my sincere appreciation for your passionate communication and efficient package management. I have reviewed the documentation and code related to the use of UMAP, but find myself in need of expert advice.
My intention is to use UMAP for clustering and measuring the similarity between arrays of sentence embeddings. There are no labels associated with this data, and I have several questions about this process. Additionally, I wish to logically discuss how the text similarity results compare to the outcomes provided by UMAP.
Any keywords, references, or preliminary answers you could provide would be greatly appreciated.
Thank you once again for your wonderful project.