Approach ( idea inspired from topic modelling on user prompts from Chatbot Arena paper
To study the prompt diversity, we build a topic modeling pipeline with BERTopic3 (Grootendorst, 2022). We start with transforming user prompts into representation vectors using OpenAI’s text embedding model (text-embedding-3-small). To mitigate the curse of dimensionality for data clustering, we employ UMAP (Uniform Manifold Approximation and Projection) (McInnes et al., 2020) to reduce the embedding dimension from 1,536 to 5. We then use the hierarchical density-based clustering algorithm, HDBSCAN, to identify topic clusters with minimum cluster size 32. Finally, to obtain topic labels, we sample 10 prompts from each topic cluster and feed into GPT-4-Turbo for topic summarization.
Treat it as unsupervised problem.
Approach ( idea inspired from topic modelling on user prompts from Chatbot Arena paper