microsoft / LightLDA

Scalable, fast, and lightweight system for large-scale topic modeling
http://www.dmtk.io
MIT License
842 stars 234 forks source link

Please provide recommendations in the documentation for how to divide a corpus #16

Closed bskaggs closed 8 years ago

bskaggs commented 8 years ago

Thanks for sharing this implementation!

How does the number of machines, amount of memory per machine, corpus size, and vocabulary size affect how a user should divide up a corpus?

feiga commented 8 years ago

Hi, To partition your corpus. First, if you have n machines, at least you need to partition to n parts. If each part can load in memory, it's fine to use one part in one machine. If it's too big to load in memory, you can further partition the part in one machine to multiple parts, and run lightlda with our of core computing.

bskaggs commented 8 years ago

Is there any significant memory overhead in beyond the size of the file on disk? I.e., if I have approximately 15GB of memory free per machine, should I plan on making 15GB files?

feiga commented 8 years ago

Your file size should less than your available memory size. The program not only need to load data, but also need to allocate memory for execution needed. This note may help you. https://github.com/Microsoft/lightlda/tree/master/example#note-on-the-arguments-about-capacity

bskaggs commented 8 years ago

Great, thank you!