microsoft / LightLDA

Scalable, fast, and lightweight system for large-scale topic modeling
http://www.dmtk.io
MIT License
842 stars 235 forks source link

Questions about the DMTK LightLDA #50

Closed NicolasWinckler closed 7 years ago

NicolasWinckler commented 7 years ago

Hello,

first of all thank you for sharing the LightLDA code, which seems to me the best C++ open-source LDA for its Metropolis-Hasting sampler and its scalability. I would like to integrate the LightLDA in my project, which require changing the IO from file to MongoDB. I have started to look in more detail into the code, and have some questions.

1) Any hint where I should first look for such a task? From my understanding, the input data handling should be done in the multiverso::lightlda::DataBlock, and also in multiverso::lightlda::DiskDataStream and/or multiverso::lightlda::MemoryDataStream. Is that correct? 2) About the multiverso::lightlda::DataBlock implementation, in https://github.com/Microsoft/LightLDA/blob/master/src/data_block.cpp line 98, I do not understand what the purpose of the Write() function. Probably I missed one point, but as I understood it, the DataBlock::Write() function write into a temporary file, e.g. block.0.temp, the buffer which was filled by the DataBlock::Read(file) function, where file is the original input data block file, e.g. block.0. Then the block.0.temp is renamed to block.0 and thus overwrite the original input, i.e. block.0 Is this correct? If yes why do we need to do this?

3) Do you intend to refactor LightLDA to be compatible with the new multiverso API at some point? 4) In the WWW15 paper of LightLDA, it is said that LightLDA is built on top of the Petuum framework in SSP mode with parameter s=1, which seems to not be the case anymore, isn't it? I have read somewhere else that Multiverso only support BSP and ASP mode. What is the case of the DMTK LightLDA, BSP or ASP? 5) What are the main difference between the petuum version described in the paper and the new one on the DMTK GitHub repository?

Thank you very much for your reply

feiga commented 7 years ago
  1. Yes, the DataBlock defines how to read/write data from/to disk. You may need to change the disk io to your MongoDB io API. DiskDataStream is just a wrapper for out-of-core computing.
  2. The Write() function write the data to disk file with binary format. The data contains not only the tokens in the corpus, but also the assigned topics to each token, which is part of model parameter and is changed during training. We periodically write data to persistent storage. Then the data file is also the checkpoint file. This is necessary when we do model slice that our paper describes.
  3. No, we don't have plan to use the new API. Currently LightLDA works with the previous version.
  4. The latest version of Multiverso only supports BSP/ASP. LightLDA don't use the latest API, the opensourced implementation indeed is SSP mode with s = 1.
  5. It's just different implementation.
NicolasWinckler commented 7 years ago

Thanks a lot for your fast reply!