microsoft / LightLDA

Scalable, fast, and lightweight system for large-scale topic modeling
http://www.dmtk.io
MIT License
842 stars 234 forks source link

doc topic distributed #4

Closed wanggaohang closed 8 years ago

wanggaohang commented 8 years ago

I see the result is server_0_table_0.model and server_0_table_1.model, server_0_table_0.model is the distributed of topic by terms. but server_0_table_1.model only has one line. could i get the distributed of topic by every docid?

feiga commented 8 years ago

@wanggaohang Sorry for the ambiguous output file name. The dump model is what stores in parameter server. Table 0 is word-topic-table. And table 1 is summary row, which is a [# of topics]-dimension vector containing the occurrence count of each topics.

For doc-topic distribution, sorry the current version didn't provide this. Since we used to train lda model to get the word-topic table, and then use the model to inference other documents for some applications.

It would be easy to output doc-topic distribution, if users feel it's useful. Contributions are also warmly welcomed.

boche commented 8 years ago

Indeed, I think output both doc-topic distribution and topic-word distribution would be helpful.

feiga commented 8 years ago

@wanggaohang @boche

Now lightlda can dump doc-topic statistics when finish training. Thanks.

LWP-PING commented 8 years ago

@feiga for word-topic-table,if the value is bigger, it means that the word has a bigger weight to a topic??