microsoft / LightLDA

Scalable, fast, and lightweight system for large-scale topic modeling
http://www.dmtk.io
MIT License
842 stars 234 forks source link

The topics don't match when every infer.@feiga #83

Open RyanPeking opened 3 years ago

RyanPeking commented 3 years ago

The same sentence, the result likes this:

first infer: 0 11:1 18:1 32:1 63:1 69:1 75:1 91:2 110:1 172:1 174:2 218:1 269:1 347:2 359:2

the next infer: 0 13:1 28:2 66:1 110:1 135:2 151:1 181:1 235:1 240:1 261:1 284:1 317:1 353:1 355:1 360:2 there is not same when every infer

but, when the topic is few, there is no problem 0 1:8 8:6 15:1 0 1:9 7:1 8:4 15:1

That makes me confused

@feiga
Thank you very much. you are right! I fixed two things in my latest commit 1) To make doc_topic_counter intact, infering slice by slice per interation as you metioned above 2) When sampling at inference phase, the word related term of Pi, i.e., n_sw_beta, n_s_beta_sum, n_tw_beta and n_t_beta_sum, SHOULD BE FIXED, which was ignored by our previous discussion.

After doing so, the result gets much better, here is the first 2 documents ============training phase============= 0 260:1 549:2 778:1 1178:2 1309:1 1789:1 1843:2 2131:2 2390:3 2886:1 1 93:1 140:1 204:1 278:4 320:2 404:1 814:1 856:1 1164:2 1496:1 1627:4 1629:1 2059:1 2122:1 2177:1 2430:1 2686:1 2818:1 2880:1 ==============inference phase========= 0 47:1 559:1 778:1 1178:2 1345:2 1843:1 2131:4 2390:3 2886:1 1 93:1 204:1 278:4 320:2 404:2 600:1 711:1 856:1 1164:2 1461:1 1496:1 1627:4 2059:1 2122:1 2144:1 2430:1 2518:1 2818:1

I think it is almost correct

However, I think there are some defects in current logic. First of all, It is unnecessary to re-build alias table per slice/block/iteration. On the other hand, it's unnecessary to build alias table for every words in the big vocab of training phase. Maybe it's better to limit user's input to just one block, and generate just one slice for block without vocab spliting . How do you think it?

Thanks

Originally posted by @hiyijian in https://github.com/microsoft/LightLDA/issues/14#issuecomment-167526773