MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
The current implementation loads the entire input file into memory, leading to memory growth and exhaustion for large data sets. This is a POC for out of core data sets.
Notes:
not a Java dev, this works but can likely be improved. PR is for visibility of the issue, which I spent a week or more on and off with.
does not solve the issue of the LDA model training trying to load all the data into memory, if possible we should find a way to make that iterable as well.
Further improvements:
make use of threads to speed this up, using one thread for 100M plus instances takes quite a while
The current implementation loads the entire input file into memory, leading to memory growth and exhaustion for large data sets. This is a POC for out of core data sets.
Notes:
not a Java dev, this works but can likely be improved. PR is for visibility of the issue, which I spent a week or more on and off with.
does not solve the issue of the LDA model training trying to load all the data into memory, if possible we should find a way to make that iterable as well.
Further improvements: