[QST] How to define matrix y in semi-supervised UMAP

joaorulff commented 2 years ago

For semi-supervised dimensionality reduction using UMAP, should I follow the same guidelines described here: Using Partial Labelling (Semi-Supervised UMAP)?

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

danielperezr88 commented 2 years ago

Also interested on this 👍

I'm trying to use cuml's UMAP implementation on BERTopic (which makes use of this semi-supervised feature from the original umap-learn's implementation), but results I'm taking suggest that this implementation doesn't work in the same way. It even seems like target labels are ignored.

beckernick commented 2 years ago

I'm trying to use cuml's UMAP implementation on BERTopic (which makes use of this semi-supervised feature from the original umap-learn's implementation)

You may be interested in this RAPIDS BERTopic project: https://github.com/rapidsai/rapids-examples/tree/main/cuBERT_topic_modelling cc @VibhuJawa

VibhuJawa commented 2 years ago

I'm trying to use cuml's UMAP implementation on BERTopic (which makes use of this semi-supervised feature from the original umap-learn's implementation)

You may be interested in this RAPIDS BERTopic project: https://github.com/rapidsai/rapids-examples/tree/main/cuBERT_topic_modelling cc @VibhuJawa

We are currently not using guided modelling using seeds so we dont use semi-supervised umap currently. That said you should still be able to use y as labels as a array similar to how Bert-topic does here . ( Code linked below) Please let us know if you cant get it to work.

but results I'm taking suggest that this implementation doesn't work in the same way.

I think the delta that you saw might have been due to bert-topic usingcosine distance while UMAP's default is euclidian. To do the same with rapids follow below below (or : cuBERT_topic_modelling )

Example of using cuML guided Umap can be something like below.

from cuml.neighbors import NearestNeighbors
from cuml.manifold import UMAP

# # Extract embeddings (Can be done via your fav transform)
embeddings = create_embeddings(....)
embeddings = cp.fromDlpack(torch.utils.dlpack.to_dlpack(embeddings))

Dimensionality reduction using UMAP following BERT-Topic Defaults


m_cos = NearestNeighbors(n_neighbors=15, metric="cosine")
m_cos.fit(embeddings)
knn_graph_cos = m_cos.kneighbors_graph(embeddings, mode="distance")
rapids_umap = UMAP(n_neighbors=15, n_components=5, min_dist=0.0)

### Setting random seeds to verify usage
y = cp.random.randint(low=-1,high=100,size=len(embeddings))
umap_embeddings = rapids_umap.fit_transform(X=embeddings,y=y, knn_graph=knn_graph_cos)

danielperezr88 commented 2 years ago

Thank you so much for the details @VibhuJawa and @beckernick . I'll certainly try all these suggested approaches. I had already given up on using cosine distance with cuML's UMAP, and I'm so happy to realize I can do this way 😊

You're right about that delta though, although I'm starting to think what I experienced had to do with something else.

To clarify, does cuML's UMAP support -1 as label for semi-supervised learning?

cjnolet commented 2 years ago

@danielperezr88,

Yes, UMAP should support -1 as a label for (categorical) semi-supervised learning. Here's where it's used in the code.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

rapidsai / cuml

[QST] How to define matrix y in semi-supervised UMAP #4345

Dimensionality reduction using UMAP following BERT-Topic Defaults