microsoft / LightLDA

Scalable, fast, and lightweight system for large-scale topic modeling
http://www.dmtk.io
MIT License
842 stars 234 forks source link

Support asymmetric Dirichlet prior optimization #5

Open feiga opened 8 years ago

feiga commented 8 years ago

The current released lightlda doesn't support asymmetric Dirichlet prior optimization. However, our internal practice show it would be useful to get better model with such feature (Also see this).

If anyone is interested in contributing this feature, please reply or contact us through email. We can collaborate on this.

hiyijian commented 8 years ago

Hi, guys. Thank you for your amazing work on large scale LDA. On the other hand, I think model quality is as important as scalability. So I am very intresting in improving it. It is exciting to know asymmetric Dirichlet prior could help. Would you please to share some experience on this? I will try my best to contribute

hiyijian commented 8 years ago

Hi, guys, I finished to try to add this new feature in PR#22 This PR supports asymmetric alpha in following steps:

  1. Add two extra tables to Multiverso. One is topic frequency table, a matrix to count each topics’ frequency. The other one is doc length table, a row to count how many document is with length k.
  2. Initialize the two extra tables with random initialized documents
  3. Learn alpha distribution with the two extra table every 5 iterations
  4. Build alias table for leanred alpha distribution
  5. Sample topics with learned alpha distribution and alias table. Meanwhile, update countings of topic frequency table if necessary

To use this new feature, please just run with an extra option "-num_alpha_iterations".

Please notice that there are two TODOs. One is Evaluation in asymmetric prior mode, the other is Inference with asymmetric prior.

feiga commented 8 years ago

Thanks, Jianyi! I will review the code.

hiyijian commented 8 years ago

@feiga , I am sorry that I made a mistake when updating topic-frequency-table. I fixed it and commit to PR#22.