joewandy / hlda

Gibbs sampler for the Hierarchical Latent Dirichlet Allocation topic model
GNU General Public License v3.0
147 stars 38 forks source link

Choosing parameters for large dataset of short texts #2

Open bwang482 opened 6 years ago

bwang482 commented 6 years ago

Thanks for your great work Joe!

Following the provided notebook, I have been trying to use hlda to infer topics on a large set (~100,000 docs) of short text docs with vocab size of 15000. The sampling is very slow, took about 11 hours for 10 iterations (n_samples = 10).

From my results as well as your demo It seems level-0 only has one topic which contains all docs. It makes sense since level-0 is at the top of the hierarchy. But I still want to confirm that if I want to have 4 levels of topics with each level containing different topic/cluster assignments, I should setnum_levels = 5?

Finally, may I ask how to (or if there is any intuition I can use ) choose values for alpha and gamma? Especially for inferring large set of short text docs?

Thanks again.