microsoft / LightLDA

Scalable, fast, and lightweight system for large-scale topic modeling
http://www.dmtk.io
MIT License
842 stars 234 forks source link

Can I save model parameters throughout training, not just at the end? #32

Open michaelchughes opened 8 years ago

michaelchughes commented 8 years ago

Thanks for making your code available! I'm interested in benchmarking the single-machine version of lightlda against some other topic model algorithms, for some not too big datasets (about 10k - a few million documents). One thing I'd like to do is have snapshots of model parameters saved on disk throughout training. For example, after the 1st iteration, the 100th iteration, the 200th iteration, the 300th iteration, etc. Is this possible?

Right now, it just seems that the only saving that happens is after the final iteration. This is undesirable, since sometimes code takes a long time to run and I'd like to inspect intermediate results. It's also useful to understand how long a run really needs to be before its performance starts to plateau.

If this isn't possible with the current code, I'll try to make the necessary changes myself. My guess is that I'll need to add code to the loop over iterations in around lines 67-90 of lightlda.cpp. https://github.com/Microsoft/lightlda/blob/master/src/lightlda.cpp

When the desired checkpoint iteration is reached, I should call something like the DumpModel() function from multiverso/server.cpp https://github.com/Microsoft/multiverso/blob/9ed99cd2d3080a8683d1c511de5927e2b7274438/src/multiverso/server.cpp

Does that sound about right? Any other tips? Thanks in advance!

michaelchughes commented 8 years ago

Quick follow-up: Looks like Trainer::Dump might almost do what I need it too, except that I can't tell if it saves the vocabword-topic counts for only a current slice/block of data, or the entire dataset.

Relevant file: https://github.com/Microsoft/lightlda/blob/master/src/trainer.cpp

Any help figuring this out would be much appreciated.

feiga commented 8 years ago

Yes, you're right. you can try to uncomment the line 100 of trainer.cpp. https://github.com/Microsoft/lightlda/blob/master/src/trainer.cpp#L100

Since you're going to benchmark algorithm performance with single machine, one issue is that the open-sourced code is design as a distributed system running on a cluster. There still has communication logic even you run with one machine. We have pipelined the training and communication, which will cause the model a little delay and thus hurt the performance. Maybe we will refactor the code later to make the single-machine version better.