michalovadek / top2vecr

An R implementation of top2vec, a topic modelling technique relying on jointly learned document and word embeddings
8 stars 1 forks source link

hierarchically cluster topics to a pre-specified K #3

Open michalovadek opened 3 years ago

michalovadek commented 3 years ago

by default hdbscan finds an optimal number of topics using its algorithm. We should add a function allowing the user to apply further hierarchical clustering on the results whereby the "optimal K" number of topics will be clustered down to a user-defined K number. In addition to making top2vec comparable to standard topic modelling techniques like LDA, further clustering might be useful for working with topic models from large corpora where hdbscan might identify dozens or hundreds of topics