Iterate over input data, don't load into memory - Githubissues

mimno / Mallet

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

https://mimno.github.io/Mallet/

Other

984 stars 344 forks source link

Iterate over input data, don't load into memory #170

Open jfelectron opened 5 years ago

jfelectron commented 5 years ago

The current implementation loads the entire input file into memory, leading to memory growth and exhaustion for large data sets. This is a POC for out of core data sets.

Notes:

not a Java dev, this works but can likely be improved. PR is for visibility of the issue, which I spent a week or more on and off with.
does not solve the issue of the LDA model training trying to load all the data into memory, if possible we should find a way to make that iterable as well.

Further improvements:

make use of threads to speed this up, using one thread for 100M plus instances takes quite a while