microsoft / LightLDA

Scalable, fast, and lightweight system for large-scale topic modeling
http://www.dmtk.io
MIT License
842 stars 234 forks source link

Inferencing of new/unseen documents #14

Open bensigma opened 8 years ago

bensigma commented 8 years ago

Hi there

As far as i see there is currently no implementation for inferencing new/unseen documents, once the model is trained. Is that correct? If yes are you planning to add that, or if no do you have any pointers how to accomplish that and contribute?

Thanks Ben

feiga commented 8 years ago

Hi, Ben,

Yes, currently there is no implementation for inference new document. Contribution is warmly welcomed! :)

The procedure of inference is very similar with training. When training a model, we sample each token in training corpus, and then modify all the statistics (both word-topic-table line 46-49 and doc-topic-table line 44-45 ) based on the new sampled topic assignment. When interring a new document, we sample token in this document, and only modify the doc-topic-table. After a burn-in stage, you will get the result. Note that distributed implementation is unnecessary for inference. A single machine program is enough.

To implement, you need to do,

  1. Load the trained model.
  2. Build the alias table (like line 54-63)
  3. Infer your data and output after convergence.

-Fei

bensigma commented 8 years ago

Hi Fei, ok thanks I'll have a shot ;)

hiyijian commented 8 years ago

hi, guys, I am working on this. followed by Fei's reply, I implemented it, But the result seems wrong. For example, for first document of nytimes(doc#0), the infered doc-topic vector is:

0 0:186 19619:1 4459:1 13271:1 14785:1 6836:1 15266:1

but the output from training phase has 288 topics.additionaly, every document's topic#0 dominates others. It's obivious wrong. I will pull my codes after some debug and hope for help. thank you

hiyijian commented 8 years ago

hi. guys. I pulled request pull#17. it is obviously buggy and I have no clue what goes wrong. pls help

chivee commented 8 years ago

Hi @hiyijian , thanks for your contribution Fei and I will look into your pull request and trying give you some advices.

hiyijian commented 8 years ago

@chivee that's cool. thank you

hiyijian commented 8 years ago

for bug “every document's topic#0 dominates others”, I fixed it in the latest commit by randomly initialize the doc vector(shame on me). and the result turns to be fairly "normal", however, the result of exact same doc from training phase and inference phase are so diffrient from each other.

feiga commented 8 years ago

Hi, Hiyijian, thanks!

First of all, I think It's better that the inference algorithm is a standalone program. This is because for inference, a single machine program is enough (for large scale dataset, you can use multiple independent processes). Then you can load the trained model into some local buffer, and use it more convenient and efficiently than the storage provided by Multiverso. You can still reuse the current logic, like trainer, alias method, and sampling algorithm, without the time and memory performance hurt incurred by the communication layer which is dedicated for a distributed system.

Back to the inference problem. I'm still not sure the reason. You said "output from training phase has 288 topics". It's supposed that a document would not have so many different topics. How many iterations do you use in training and inference phase? I will further read your code.

-Fei

hiyijian commented 8 years ago

hi, @feiga, thank you

I realized the perfomace issue using storage provided by Multiverso before I wrote those codes. But for coding convenience, I decided re-use the codes as much as posible and minimize modification until I get reasonable results. After that, I think I'd like to switch storage from Multiverso to local RAM. It is partly beacause I have no confidence about how well I understand these codes. Sorry for that

For issue that I get too much topics per document, perhaps caused by non-convergence, I dont know, but my setting is 100 iterations. Pls forget that and let me explain more clear.

I do an new experiment on a small subset of nytimes with latest commit(pls check it) like this: #########training phase########### 1\ get first 10000 documents of nytimes.libsvm(head -10000 nytimes.libsvm > nytimes.libsvm.10000 ) 2\ $bin/dump_binary $dir/nytimes.libsvm.10000 $dir/nytimes.word_id.dict $dir 0 3\ $bin/lightlda -num_vocabs 111400 -num_topics 400 -num_iterations 100 -alpha 0.1 -beta 0.01 -mh_steps 2 -num_local_workers 4 -num_blocks 1 -max_num_document 10000 -input_dir $dir -data_capacity 32 After these steps, I got : doc likelihood : -1.095947e+07, word likelihood : 1.231914e+08, Normalized likelihood : -2.313777e+08, which I think it is converged, right? Finally, I got doc_topic.0 and model files

2\ $bin/dump_binary $dir/nytimes.libsvm.10000 $dir/nytimes.word_id.dict $dir 0 3\ $bin/lightlda -num_vocabs 111400 -num_topics 400 -num_iterations 100 -alpha 0.1 -beta 0.01 -mh_steps 2 -num_local_workers 4 -num_blocks 1 -max_num_document 10000 -input_dir $dir -data_capacity 32 After these steps, I got : doc likelihood : -1.095947e+07, word likelihood : 1.231914e+08, Normalized likelihood : -2.313777e+08, which I think it is converged, right? Finally, I got doc_topic.0 and model files #########inference phase########### of nytimes.libsvm(head -10000 nytimes.libsvm > nytimes.libsvm.10000 ) 2\ $bin/dump_binary $dir/nytimes.libsvm.10000 $dir/nytimes.word_id.dict $dir 0 3\ $bin/lightlda -num_vocabs 111400 -num_topics 400 -num_iterations 100 -alpha 0.1 -beta 0.01 -mh_steps 2 -num_local_workers 4 -num_blocks 1 -max_num_document 10000 -input_dir $dir -data_capacity 32 After these steps, I got : doc likelihood : -1.095947e+07, word likelihood : 1.231914e+08, Normalized likelihood : -2.313777e+08, which I think it is converged, right? Finally, I got doc_topic.0 and model files #########inference phase########### 1\ copy model(server_0_table_0 and server_0_table_1) and nytimes.word_id.dict to another dir 2\ get first 5 documents of nytimes.libsvm(head -5nytimes.libsvm > nytimes.libsvm.5) 3\ $bin/dump_binary $dir/nytimes.libsvm.5 $dir/nytimes.word_id.dict $dir 0 4\ $bin/lightlda -infer -num_vocabs 111400 -num_topics 400 -num_iterations 100 -alpha 0.1 -beta 0.01 -mh_steps 2 -num_local_workers 1 -num_blocks 1 -max_num_document 100 -input_dir $dir -data_capacity 32 Finally, I got another doc_topic.0

#######compare doc#0 of trianing and inference ############## trianing 0 1:1 26:3 28:2 50:1 84:1 98:1 102:1 112:1 133:51 148:13 153:1 176:1 183:1 187:18 193:3 200:1 201:1 205:4 208:3 213:1 216:1 243:2 254:4 269:1 275:1 311:61 318:1 326:1 339:1 341:1 353:1 354:1 355:1 360:2 373:1 389:3 inference 0 1:1 2:2 5:1 10:1 18:1 20:1 25:1 26:1 27:2 28:1 37:1 38:1 40:1 43:1 44:1 45:1 46:1 49:1 51:1 52:1 56:1 58:1 63:3 64:1 68:1 72:1 76:1 80:2 84:1 85:1 86:3 91:1 94:3 95:1 97:1 104:1 106:3 109:3 113:2 115:1 120:2 121:1 123:2 125:1 133:3 138:2 140:1 141:1 142:1 147:2 156:2 163:1 165:1 167:1 169:1 170:3 174:1 175:1 176:1 181:1 182:1 183:1 184:1 187:1 188:1 191:1 192:1 193:2 195:2 196:2 199:1 205:2 208:1 211:1 218:1 219:1 221:1 223:1 225:1 226:1 227:1 229:1 235:1 238:1 242:2 243:2 246:1 247:1 248:1 251:1 256:1 260:2 265:1 266:1 267:1 275:1 277:1 283:2 284:2 285:1 295:1 296:4 302:1 304:1 308:1 309:1 312:1 314:1 317:1 318:2 321:1 325:1 328:1 332:4 334:1 337:1 338:2 339:2 342:1 344:2 348:1 350:1 351:1 353:3 357:1 364:1 367:1 368:5 371:2 373:2 377:1 382:1 384:1 385:1 386:1 392:4 393:1 395:1 396:1

hiyijian commented 8 years ago

Hi, guys Followed by @feiga 's latest tips, I spend a little bit more time to switch storage from parameter server to local buffer for inference in my latest commit of pull#17. However, the same issue(the result of exact same doc from training phase and inference phase are so diffrient from each other) still exists. I debug quite a lot into alias table building and samping, Unfortunately, I still have no clue about that. I am still working on it and please let me know if any clue found

hiyijian commented 8 years ago

ping

feiga commented 8 years ago

@hiyijian hi, sorry for the late response.

In line 99 of infer.cpp. The order of for iteration should be

foreach iteration 
    foreach data block 
        foreach slice
            // Handle

If there is only one slice, it should be OK. But with more than two slices, it would be not correct.

The initial implementation is highly related with model slice, which helps to handle big model in a local machine with modest size memory. But this also makes thing complicated when dealing with small size model like inference(since model is converged and sparse). I doubt there may have some problems when you hack the codes with such logic.

I haven't seen other potential problems. You can try with some tests. 1) make sure your model loader is correct. 2) Test with only one document, and debug its behavior.

hiyijian commented 8 years ago

@feiga
Thank you very much. you are right! I fixed two things in my latest commit 1) To make doc_topic_counter intact, infering slice by slice per interation as you metioned above 2) When sampling at inference phase, the word related term of Pi, i.e., n_sw_beta, n_s_beta_sum, n_tw_beta and n_t_beta_sum, SHOULD BE FIXED, which was ignored by our previous discussion.

After doing so, the result gets much better, here is the first 2 documents ============training phase============= 0 260:1 549:2 778:1 1178:2 1309:1 1789:1 1843:2 2131:2 2390:3 2886:1 1 93:1 140:1 204:1 278:4 320:2 404:1 814:1 856:1 1164:2 1496:1 1627:4 1629:1 2059:1 2122:1 2177:1 2430:1 2686:1 2818:1 2880:1 ==============inference phase========= 0 47:1 559:1 778:1 1178:2 1345:2 1843:1 2131:4 2390:3 2886:1 1 93:1 204:1 278:4 320:2 404:2 600:1 711:1 856:1 1164:2 1461:1 1496:1 1627:4 2059:1 2122:1 2144:1 2430:1 2518:1 2818:1

I think it is almost correct

However, I think there are some defects in current logic. First of all, It is unnecessary to re-build alias table per slice/block/iteration. On the other hand, it's unnecessary to build alias table for every words in the big vocab of training phase. Maybe it's better to limit user's input to just one block, and generate just one slice for block without vocab spliting . How do you think it?

Thanks

feiga commented 8 years ago

@hiyijian Great! It should be correct now. I noticed you had removed those "minus one" in sampler.

Yes, it's unnecessary to rebuild alias table. Since the model is fixed in inference phrase, you need to just build it once in the beginning. I agree with you that one block and one slice is better when inferring. But this slice mechanism is needed in training to solve big model challenge, with which we only need to load and store part(a slice) of the whole model(maybe too big to fit in memory). The vocabulary is split automatically based on the memory size.

I'm not clear what do you mean by the following words

it's unnecessary to build alias table for every words in the big vocab of training phase

Thanks, Fei

hiyijian commented 8 years ago

Okay, @feiga, Thanks for your nice guiding.

Finally, I think I reached the goal of adding this new feature. And please let me explain the main route of what I did:

  1. Load the trained model into local buffer, instead of parameter server
  2. Schedule each block/vocab pair without vocabulary splitting, i.e., one slice per pair, instead of mutiple slices
  3. Build the alias table only once before inferring each pair, instead of building it in every iteration
  4. Infer documents just like training, except fixing word-topic related counting, i,e, word-topic counting matrix, n_sw_beta, n_s_beta_sum, n_tw_beta and n_t_beta_sum

last tip: in this implementation, Although you can fed the program with mutiple block/vocab pair and get right answers, I strongly suggest that one would better to fed just one merged block and turn on out-of-core swich if it's too big to fit in memory. Doing this way, you can avoid building alias table for duplicated words and save runing time

One more thing to report, I did some model visualization work and compared the trained models using lightlda and plda+ with same setting(alpha, beta, 100 iteration, 2,000,000 documets in chinese, 3000 topics), I surprisely and dejectedly found that the model quality of lightlda is far worse than plda+. I dont know why. It maybe better if introducing asymmetric Dirichlet prior? Maybe we could discuss it in another issue.

Thanks

feiga commented 8 years ago

Thanks! @hiyijian I agree with your last tip.

It's natural that lightlda performs worse in terms of "model quality" comparing with plda+(using sparselda sampler) within same interations because lightlda is a kind of "approximation" of the original sampler. It's expected that lightlda runs more iterations with much less training time to get a better quality model. I'm a little surprise the quality is FAR worse than plda+. By saying model quality, what's your evaluation metric? Maybe you can share the results with me?

As for the reasons, yes, we observed the way how hyperparameter (alpha and beta) affect the lightlda and sparselda sampler differs. lightlda is more sensitive. Maybe you can try with different alpha and beta (I suggest using relatively small values). As you point out, introducing asymmetric Dirichlet prior would helps.

hiyijian commented 8 years ago

@feiga Thanks. Do you have any plan to merge the pull request into your master. I would like to give my favour if any needed. For model quality issue, the way I judge model qaulity is just randomly picking some topics and judge by human-experience. For example, ========plda+'s topic============= TOPIC: 7 42094.0 嵌入式 4477.0 操作系统 2378.0 linux 1972.0 嵌入式系统 1706.0 内核 1272.0 处理器 1216.0 硬件 1167.0 arm 1014.0 移植 798.0 windows 787.0 驱动程序 759.0 接口 593.0 设备 588.0 实时操作系统 495.0 应用程序 487.0 微处理器 458.0 驱动 392.0 硬件平台 385.0 模块 375.0 ========lightlda's topic============= TOPIC: 5 18379 衰老 4520 半乳糖 1686 打击 1364 黑土 1105 实习生 861 tunel 702 明显增加 612 信息产业 547 延缓 509 渗滤 451 原位 437 变通 335 生产基地 291 pbdes 235 染色 192 居里温度 189 皮下注射 165 酪酸 153 无助 127 ③ 120 上调 110 离退休人员 105

I will spare some time uploading my training corpus and models from lightlda and plda+ to pan.baidu.com. And I plan to tune the parameters as your guidence and see what will happen.

Lastly, I would like to collaborate with you guys to introduce asymmetric Dirichlet prior

feiga commented 8 years ago

Hi, @hiyijian Thanks a lot for your great work! I'll review the code again and make some test. Besides, you need to sign a contribution license agreement (follow here). After that I will merge the PR to the master.

Thanks for your interests about asymmetric dirichlet prior, let's discuss in this issue. @yuanms2 Jinhui can share his experience on this.

Happy new year! Fei

hiyijian commented 8 years ago

@feiga , Really cool! Happy new year and best regards

hiyijian commented 8 years ago

hi, @feiga, I noticed you have merged the PR. But I am so sorry about the small mistake made in Makefile.:g++–4.8 should be relaced by g++. To let you check model quality issue observed by me, the corpus and trained models are uploading, I guess it needs additional two days. Sorry again

feiga commented 8 years ago

Hi, @hiyijian It's OK. I will fix it.

FYI, I just fixed a bug, which you can see from the latest commit. It should affect the trained model. You can get the latest version and try again.

hiyijian commented 8 years ago

Hi, @feiga , I am so happy to observe that model quality gets MUCH MUCH better using latest lightlda with lower alpha(50.0 / topic_num). I think these is no need to check the old models now.

I merged your master and removed unnecessary barrier wait when inferring in pull#19. Please check it.

Thanks.

faizan30 commented 7 years ago

Hi, is there an example/documentation to inference new/unseen documents?

koustuvsinha commented 7 years ago

@hiyijian @feiga Hi can you specify which parameters to provide for the inference task? The usage is not helpful at the moment, so I am getting confused. The infer command loads pretrained model server_0_table_0.model, but how about the other parameters?

LightLDA Inference usage:
-num_vocabs <arg>        Size of dataset vocabulary 
         [is this the num vocabs for new/unseen documents dumped by dump_block? 
          or is this of the training set?]
-num_topics <arg>        Number of topics. Default: 100
-num_iterations <arg>    Number of iteratioins. Default: 100
-mh_steps <arg>          Metropolis-hasting steps. Default: 2
-alpha <arg>             Dirichlet prior alpha. Default: 0.1
-beta <arg>              Dirichlet prior beta. Default: 0.01

-num_blocks <arg>        Number of blocks in disk. Default: 1
-max_num_document <arg>  Max number of document in a data block 
       [again, is this of the new/unseen documents? or of the training set?]
-input_dir <arg>         Directory of input data, containing
                         files generated by dump_block 
      [this directory should contain the models and dump files from new/unseen 
      data, right?]

Please clarify the above questions! Thanks!