Open joaorulff opened 2 years ago
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
Also interested on this 👍
I'm trying to use cuml's UMAP implementation on BERTopic (which makes use of this semi-supervised feature from the original umap-learn's implementation), but results I'm taking suggest that this implementation doesn't work in the same way. It even seems like target labels are ignored.
I'm trying to use cuml's UMAP implementation on BERTopic (which makes use of this semi-supervised feature from the original umap-learn's implementation)
You may be interested in this RAPIDS BERTopic project: https://github.com/rapidsai/rapids-examples/tree/main/cuBERT_topic_modelling cc @VibhuJawa
I'm trying to use cuml's UMAP implementation on BERTopic (which makes use of this semi-supervised feature from the original umap-learn's implementation)
You may be interested in this RAPIDS BERTopic project: https://github.com/rapidsai/rapids-examples/tree/main/cuBERT_topic_modelling cc @VibhuJawa
We are currently not using guided modelling using seeds so we dont use semi-supervised umap currently. That said you should still be able to use y as labels as a array similar to how Bert-topic does here . ( Code linked below) Please let us know if you cant get it to work.
but results I'm taking suggest that this implementation doesn't work in the same way.
I think the delta that you saw might have been due to bert-topic usingcosine
distance while UMAP's default is euclidian
. To do the same with rapids
follow below below (or : cuBERT_topic_modelling )
Example of using cuML
guided Umap can be something like below.
from cuml.neighbors import NearestNeighbors
from cuml.manifold import UMAP
# # Extract embeddings (Can be done via your fav transform)
embeddings = create_embeddings(....)
embeddings = cp.fromDlpack(torch.utils.dlpack.to_dlpack(embeddings))
m_cos = NearestNeighbors(n_neighbors=15, metric="cosine")
m_cos.fit(embeddings)
knn_graph_cos = m_cos.kneighbors_graph(embeddings, mode="distance")
rapids_umap = UMAP(n_neighbors=15, n_components=5, min_dist=0.0)
### Setting random seeds to verify usage
y = cp.random.randint(low=-1,high=100,size=len(embeddings))
umap_embeddings = rapids_umap.fit_transform(X=embeddings,y=y, knn_graph=knn_graph_cos)
Thank you so much for the details @VibhuJawa and @beckernick . I'll certainly try all these suggested approaches. I had already given up on using cosine distance with cuML's UMAP, and I'm so happy to realize I can do this way 😊
You're right about that delta though, although I'm starting to think what I experienced had to do with something else.
To clarify, does cuML's UMAP support -1
as label for semi-supervised learning?
@danielperezr88,
Yes, UMAP should support -1
as a label for (categorical) semi-supervised learning. Here's where it's used in the code.
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
For semi-supervised dimensionality reduction using UMAP, should I follow the same guidelines described here: Using Partial Labelling (Semi-Supervised UMAP)?