the next infer:
0 13:1 28:2 66:1 110:1 135:2 151:1 181:1 235:1 240:1 261:1 284:1 317:1 353:1 355:1 360:2
there is not same when every infer
but, when the topic is few, there is no problem
0 1:8 8:6 15:1
0 1:9 7:1 8:4 15:1
That makes me confused
@feiga
Thank you very much. you are right! I fixed two things in my latest commit
1) To make doc_topic_counter intact, infering slice by slice per interation as you metioned above
2) When sampling at inference phase, the word related term of Pi, i.e., n_sw_beta, n_s_beta_sum, n_tw_beta and n_t_beta_sum, SHOULD BE FIXED, which was ignored by our previous discussion.
After doing so, the result gets much better, here is the first 2 documents
============training phase=============
0 260:1 549:2 778:1 1178:2 1309:1 1789:1 1843:2 2131:2 2390:3 2886:1
1 93:1 140:1 204:1 278:4 320:2 404:1 814:1 856:1 1164:2 1496:1 1627:4 1629:1 2059:1 2122:1 2177:1 2430:1 2686:1 2818:1 2880:1
==============inference phase=========
0 47:1 559:1 778:1 1178:2 1345:2 1843:1 2131:4 2390:3 2886:1
1 93:1 204:1 278:4 320:2 404:2 600:1 711:1 856:1 1164:2 1461:1 1496:1 1627:4 2059:1 2122:1 2144:1 2430:1 2518:1 2818:1
I think it is almost correct
However, I think there are some defects in current logic. First of all, It is unnecessary to re-build alias table per slice/block/iteration. On the other hand, it's unnecessary to build alias table for every words in the big vocab of training phase. Maybe it's better to limit user's input to just one block, and generate just one slice for block without vocab spliting . How do you think it?
The same sentence, the result likes this:
first infer: 0 11:1 18:1 32:1 63:1 69:1 75:1 91:2 110:1 172:1 174:2 218:1 269:1 347:2 359:2
the next infer: 0 13:1 28:2 66:1 110:1 135:2 151:1 181:1 235:1 240:1 261:1 284:1 317:1 353:1 355:1 360:2 there is not same when every infer
but, when the topic is few, there is no problem 0 1:8 8:6 15:1 0 1:9 7:1 8:4 15:1
That makes me confused
@feiga
Thank you very much. you are right! I fixed two things in my latest commit 1) To make doc_topic_counter intact, infering slice by slice per interation as you metioned above 2) When sampling at inference phase, the word related term of Pi, i.e., n_sw_beta, n_s_beta_sum, n_tw_beta and n_t_beta_sum, SHOULD BE FIXED, which was ignored by our previous discussion.
After doing so, the result gets much better, here is the first 2 documents ============training phase============= 0 260:1 549:2 778:1 1178:2 1309:1 1789:1 1843:2 2131:2 2390:3 2886:1 1 93:1 140:1 204:1 278:4 320:2 404:1 814:1 856:1 1164:2 1496:1 1627:4 1629:1 2059:1 2122:1 2177:1 2430:1 2686:1 2818:1 2880:1 ==============inference phase========= 0 47:1 559:1 778:1 1178:2 1345:2 1843:1 2131:4 2390:3 2886:1 1 93:1 204:1 278:4 320:2 404:2 600:1 711:1 856:1 1164:2 1461:1 1496:1 1627:4 2059:1 2122:1 2144:1 2430:1 2518:1 2818:1
I think it is almost correct
However, I think there are some defects in current logic. First of all, It is unnecessary to re-build alias table per slice/block/iteration. On the other hand, it's unnecessary to build alias table for every words in the big vocab of training phase. Maybe it's better to limit user's input to just one block, and generate just one slice for block without vocab spliting . How do you think it?
Thanks
Originally posted by @hiyijian in https://github.com/microsoft/LightLDA/issues/14#issuecomment-167526773