microsoft / LightLDA

Scalable, fast, and lightweight system for large-scale topic modeling
http://www.dmtk.io
MIT License
842 stars 234 forks source link

[BUG] Infer: Fail to build alias row, capacity of row = 0 #23

Open rudaoshi opened 8 years ago

rudaoshi commented 8 years ago

Hi, I've trained a model with light lda, and want to infer topic for new documents. However, when I use the infer program, it gives an error : Fail to build alias row, capacity of row = 0. The details are as follows:

  1. I run the lightlda using following command: lightlda/bin/lightlda -num_vocabs 279164 -num_topics 1000 -num_iterations 100 -alpha 0.1 -beta 0.01 -mh_steps 2 -num_local_workers 4 -num_blocks 1 -max_num_document 100000 -input_dir data -data_capacity 6200 &
  2. After the command completed, I found three files: server_0_table_0.model, server_0_table_1.model, doc_topic.0. The file server_0_table_0.model has 279163 lines and server_0_table_1.model has 1 line.
  3. I run the infer with following command by inputing the exact training data: ightlda/bin/infer -num_vocabs 279164 -num_topics 1000 -num_iterations 100 -alpha 0.1 -beta 0.01 -mh_steps 2 -num_local_workers 4 -num_blocks 1 -max_num_document 100000 -input_dir data -data_capacity 6200
  4. The procedure exits and gives following log: [INFO] [2016-02-05 11:19:02] Actual Alias capacity: 111 MB [INFO] [2016-02-05 11:19:02] loading model [ERROR] [2016-02-05 11:19:02] Fail to build alias row, capacity of row = 0 [ERROR] [2016-02-05 11:19:02] Fail to build alias row, capacity of row = 0

Anyone can tell what has happened here?

hiyijian commented 8 years ago

please check if all model files are in the same directory as training data. Infer program searches models in the path -input_dir, as a result you should move models to input_dir when inference if they are generated in somewhere else when training.

cldellow commented 8 years ago

I've seen something similar when:

(1) the testing vocabulary has some words with non zero tf (2) those words aren't associated with any topics in the trained model

This can happen even when the test dataset is the same as the training dataset.

Omitting those words from the corpus resolves the issue and the inferencing results still look OK. On a small data set of ~1000 documents these words were only 0.2% of the vocabulary. I now do this as a preprocessing step in my data pipeline. I'm not sure if this is a legitimate thing to do or not :)

HTH!

shcup commented 8 years ago

I have same problem, however I have move the model to input dir and the log show it have loaded model, but it still dose not works.

Abigale001 commented 5 years ago

I have met the same problem: Fail to build alias row, capacity of row = 0. I have moved block.0 vocab.0 vocab.0.txt, trained model in the input_dir. Anyone could help?