smilli / berkeleylm

Automatically exported from code.google.com/p/berkeleylm
1 stars 1 forks source link

-mx1000m not appropriate for Google n-grams #6

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
make-binary-from-google.sh currently uses -mx1000m

java -ea -mx1000m -server -cp ../src 
edu.berkeley.nlp.lm.io.MakeLmBinaryFromGoogle 
../test/edu/berkeley/nlp/lm/io/googledir google.binary

However, I quickly run out of heap space.

I tried -mx4000m but that ran out of heap space in about 2.5hrs.

What is an appropriate -mx setting for training on all 5 grams?
What size EC2 instance should I spin up?
How long will it take to train on all 5grams?

Original issue reported on code.google.com by tur...@gmail.com on 24 Nov 2011 at 2:59

GoogleCodeExporter commented 9 years ago
Sorry, I don't know how I missed this bug report for so long! Not sure what 
happened. 

Are you actually talking about running on the full google n-grams corpus? If 
so, then you need substantial amounts of memory, much more than the 10GB needed 
to store the n-grams once the binary is built. I haven't actually figured out 
what the minimum necessary is, but I would think you need at list 50GB of 
memory, which is available on large EC2 instances. 

However, I have pre-built binaries of these already compiled for you, so you 
can just download those (instructions are on the web page). 

Original comment by adpa...@gmail.com on 19 Feb 2012 at 5:48

GoogleCodeExporter commented 9 years ago

Original comment by adpa...@gmail.com on 9 Aug 2012 at 5:30

GoogleCodeExporter commented 9 years ago
How long does it take to build the LM on the full n-grams corpus?

Original comment by tur...@gmail.com on 19 Aug 2012 at 9:33

GoogleCodeExporter commented 9 years ago
It takes I think something on the order of 24 hours, maybe a little less. It's 
not something I've optimized heavily, so sorry about that. Let me know if you 
have any trouble building yourself (other than time and memory issues . . . )

Original comment by adpa...@gmail.com on 20 Aug 2012 at 6:51