manisnesan / fastchai

Repository capturing deep learning & nlp experiments using fastai & pytorch
Apache License 2.0
2 stars 0 forks source link

Alternative to Static Tagging Text Classification #71

Open manisnesan opened 3 months ago

manisnesan commented 3 months ago

Treat it as unsupervised problem.

Approach ( idea inspired from topic modelling on user prompts from Chatbot Arena paper

To study the prompt diversity, we build a topic modeling pipeline with BERTopic3 (Grootendorst, 2022). We start with transforming user prompts into representation vectors using OpenAI’s text embedding model (text-embedding-3-small). To mitigate the curse of dimensionality for data clustering, we employ UMAP (Uniform Manifold Approximation and Projection) (McInnes et al., 2020) to reduce the embedding dimension from 1,536 to 5. We then use the hierarchical density-based clustering algorithm, HDBSCAN, to identify topic clusters with minimum cluster size 32. Finally, to obtain topic labels, we sample 10 prompts from each topic cluster and feed into GPT-4-Turbo for topic summarization.

manisnesan commented 3 months ago

Related

https://www.aboutwayfair.com/careers/tech-blog/accelerating-catalog-tagging-automation-with-snorkels-data-centric-ai-platform-wayfairs-success-story