Closed bskaggs closed 8 years ago
Hi, To partition your corpus. First, if you have n machines, at least you need to partition to n parts. If each part can load in memory, it's fine to use one part in one machine. If it's too big to load in memory, you can further partition the part in one machine to multiple parts, and run lightlda with our of core computing.
Is there any significant memory overhead in beyond the size of the file on disk? I.e., if I have approximately 15GB of memory free per machine, should I plan on making 15GB files?
Your file size should less than your available memory size. The program not only need to load data, but also need to allocate memory for execution needed. This note may help you. https://github.com/Microsoft/lightlda/tree/master/example#note-on-the-arguments-about-capacity
Great, thank you!
Thanks for sharing this implementation!
How does the number of machines, amount of memory per machine, corpus size, and vocabulary size affect how a user should divide up a corpus?