mimno / Mallet

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
973 stars 346 forks source link

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded #198

Open jialu-stellar-xia opened 3 years ago

jialu-stellar-xia commented 3 years ago

I have this issue when importing the data to the format for LDA. I tried enlarge the MALLET_MEMORY=128G (the memory of my server is also 128G), but it still does not work.
My data contains 6,712,484 documents in one .txt file and its size is 3.07G I sampled 100 documents to test the script for importing data, it works well. But keep popping this error message when importing my entire data. Could you please help to figure out the problem? Really appreciate your help!!

截屏2021-04-11 下午8 14 08
mimno commented 3 years ago

The "bulk-load" function may be more efficient. But that size collection should definitely fit in 128G. I would suspect that the variable isn't being set in the right way for the shell script to find it.