lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.4k stars 805 forks source link

Tried Parametric UMAP but its performance does not seem to be as good as Non parametric one even after training on large data of around 20k examples in supervised fashion. #646

Open mayankgoyal1993 opened 3 years ago

mayankgoyal1993 commented 3 years ago

Tried Parametric UMAP but its performance does not seem to be as good as the Nonparametric one even after training on large data of around 20k examples in a supervised fashion. I am fully convinced that parametric is the future due to its learning from previous data and hence global learning like pre-trained models and hence transfer learning.

Can anyone help me if I am doing any basic mistakes?

Problem: I am trying to perform text clustering using Sentence Transformers embedding of 748 dimensions.

Method: I have supervised data of around 20000 samples belong to 1200 labels. I am performing PCA and then using parametric supervised UMAP (default encoder), trained and saved the model.

But after performing on test data of around 1500 samples, using the saved embedder, and using fit and transform, cluster assignment is not good as nonparametric UMAP.

Can someone tell me if I am doing anything wrong?

mayankgoyal1993 commented 3 years ago

Should I not call fit_transform again? only transform since it is test data?

timsainb commented 3 years ago

It might be worth trying a bigger encoder for the network, or a different architecture. The default encoder is pretty small. If the data is sequential it might be worth trying an RNN as an encoder as well. There are several several other ways the embeddings could end up looking different as well. If you want to post a colab notebook I'd be happy to take a look too.

mayankgoyal1993 commented 3 years ago

Actually, the 748 high dimension embeddings are already finetuned using sentence transformers. Do I still need to use a better encoder after that? I will share colab, but cannot share the data? Does that work?

timsainb commented 3 years ago

That probably won't help, but if you can reproduce the issue on an open dataset of embeddings maybe that would? I would try a better encoder. The parametric UMAP embedding is trained with the same loss as non-parametric UMAP, so the issue is that the encoder network isn't able to learn the mapping from data to embedding. So using a bigger encoder will probably help.

mayankgoyal1993 commented 3 years ago

Okay will check with the company if I can share data anyway. Thanks for the clarification. Meanwhile, I also see parametric embeddings are different every time I run the colab even after setting the seed.

import tensorflow as tf tf.random.set_seed(42) from umap.parametric_umap import load_ParametricUMAP embedder = load_ParametricUMAP(UMAP_PARAMETRIC_SAVED_PATH) embedder.batch_size = 10 embedder.random_state = 42 umap_embeddings = embedder.fit_transform(X=final_outputs_transformed_new) print(umap_embeddings)

mayankgoyal1993 commented 3 years ago

I guess it is because of GPU, on CPU it is reproducible, sorry to bother you on that.