mclevey / podlm

Probabilistic Opinion Dynamics with Language Models
MIT License
1 stars 0 forks source link

Use Dirichlet Process Bayesian Gaussian Mixture Models (DPBGMM) as clustering and representation models with BERTopic #25

Open mclevey opened 10 months ago

mclevey commented 10 months ago

In "A new method for computational cultural cartography: From neural word embeddings to transformers and Bayesian mixture models," we developed Dirichlet Process Bayesian Gaussian Mixture Models (DPBGMM) for clustering contextual embeddings. I think we can / should port those models over here and use them for both cluster and representation models in BERTopic. The benefit of using them in the cluster step is that they are model-based rather than purely data-driven, and the data generating process is very well-aligned with the task we are performing (which was the motivation for using them for modelling latent topics in the CRS project in the first place). Then we can also have a probabilistic representation model to go alongside the cTF-IDF, KeyBERT, MMR, and Llama2 representation models at the end of the BERTopic pipeline.

@tcrick you could probably do this relatively quickly since you just finished up the replication kit for our German colleagues. Let me know if that is not the case.