Clustering Data With Embeddings

Information

The question or comment is about chapter:

[ ] Introduction
[ ] Text Classification
[ ] Transformer Anatomy
[ ] Multilingual Named Entity Recognition
[ ] Text Generation
[ ] Summarization
[ ] Question Answering
[ ] Making Transformers Efficient in Production
[x] Dealing with Few to No Labels
[ ] Training Transformers from Scratch
[ ] Future Directions

Question or comment

Hello, I noticed that the book doesn't really have much information about clustering unlabeled data. I'm aware that there are some resources out there that address this question. However it would be nice to know what are some techniques that work best to cluster text, especially ones that don't rely on API calls that might be rate limited. I have been pondering on these issues lately and the winning method so far is:

Generate embeddings.
MinMax scaler on features.
Use algorithm like K-means and plot number of clusters versus silhouette score.

Would appreciate to know your thoughts on this.

Best, Nadim

nlp-with-transformers / notebooks

Clustering Data With Embeddings #136

Information

Question or comment