Choosing parameters for large dataset of short texts

Thanks for your great work Joe!

Following the provided notebook, I have been trying to use hlda to infer topics on a large set (~100,000 docs) of short text docs with vocab size of 15000. The sampling is very slow, took about 11 hours for 10 iterations (n_samples = 10).

From my results as well as your demo It seems level-0 only has one topic which contains all docs. It makes sense since level-0 is at the top of the hierarchy. But I still want to confirm that if I want to have 4 levels of topics with each level containing different topic/cluster assignments, I should setnum_levels = 5?

Finally, may I ask how to (or if there is any intuition I can use ) choose values for alpha and gamma? Especially for inferring large set of short text docs?

Thanks again.

joewandy / hlda

Choosing parameters for large dataset of short texts #2