nlp-with-transformers / notebooks

Jupyter notebooks for the Natural Language Processing with Transformers book
https://transformersbook.com/
Apache License 2.0
3.7k stars 1.13k forks source link

Clustering Data With Embeddings #136

Open NadimKawwa opened 2 months ago

NadimKawwa commented 2 months ago

Information

The question or comment is about chapter:

Question or comment

Hello, I noticed that the book doesn't really have much information about clustering unlabeled data. I'm aware that there are some resources out there that address this question. However it would be nice to know what are some techniques that work best to cluster text, especially ones that don't rely on API calls that might be rate limited. I have been pondering on these issues lately and the winning method so far is:

  1. Generate embeddings.
  2. MinMax scaler on features.
  3. Use algorithm like K-means and plot number of clusters versus silhouette score.

Would appreciate to know your thoughts on this.

Best, Nadim