microsoft / LightLDA

Scalable, fast, and lightweight system for large-scale topic modeling
http://www.dmtk.io
MIT License
842 stars 235 forks source link

solve issue 56 #84

Open Jack2313 opened 3 years ago

Jack2313 commented 3 years ago

When LightLDA dumps a binary file with a .dict obtained from partial corpus, the parameter word_num will be relatively smaller than the true maximum of wordID in the corpus. As a result, dump_binary won't write those words whose id bigger than word_num into the output file. Ignored words are probably regarded as topic 0, causing issue56

I changed the following codes in my local environment, it works fine and solved the issue.