microsoft / LightLDA

Scalable, fast, and lightweight system for large-scale topic modeling
http://www.dmtk.io
MIT License
842 stars 235 forks source link

Fail to build alias row, capacity of row = 0 Floating point exception #57

Open landesire opened 7 years ago

landesire commented 7 years ago

I use lightLDA to do new Document inference ,I changed new/Unseen Document to the libsvm file by the old vocabulary dictionary and generate datablock,then i read the mode server_0_table_0 and server_0_table_1.model I use bin/infer to infer new Doc but got this INFO] [2017-09-07 12:15:29] Actual Alias capacity: 50 MB [INFO] [2017-09-07 12:15:29] loading model [INFO] [2017-09-07 12:15:29] loading word topic table[server_0_table_0.model] [INFO] [2017-09-07 12:15:31] loading summary table[server_0_table_1.model] [ERROR] [2017-09-07 12:15:31] Fail to build alias row, capacity of row = 0 Floating point exception

Can some one helps me? Is it because there are new words in my new Doc ? but I think after change Doc to LibSVM ,there is no relevance between the Word Dictionary and the inference process.

fengyachao commented 6 years ago

I also encounter this problem, how to resolve it?

landesire commented 6 years ago

yes, I learn from the past answer

I've seen something similar when:

(1) the testing vocabulary has some words with non zero tf (2) those words aren't associated with any topics in the trained model

This can happen even when the test dataset is the same as the training dataset.

Omitting those words from the corpus resolves the issue and the inferencing results still look OK. On a small data set of ~1000 documents these words were only 0.2% of the vocabulary. I now do this as a preprocessing step in my data pipeline. I'm not sure if this is a legitimate thing to do or not :)