mimno / Mallet

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
https://mimno.github.io/Mallet/
Other
984 stars 344 forks source link

Java heap space Error : while importing large data #129

Open karthikasathishkumar opened 6 years ago

karthikasathishkumar commented 6 years ago

i have modified 1g into 10g in MEMORY=10g in bin/mallet shell script and executed import command with input size 5GB in ubuntu14 64-bit ram size 16GB. i am getting the below error in mallet and how to overcome this error. kindly suggest a better way to import data(total size of the input data = 5GB).

java.lang.OutOfMemoryError: Java heap space
    at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:68)
    at java.lang.StringBuffer.<init>(StringBuffer.java:128)
    at cc.mallet.pipe.Input2CharSequence.pipe(Input2CharSequence.java:94)
    at cc.mallet.pipe.Input2CharSequence.pipe(Input2CharSequence.java:83)
    at cc.mallet.pipe.Input2CharSequence.pipe(Input2CharSequence.java:47)
    at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:295)
    at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:283)
    at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:291)
    at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:283)
    at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:291)
    at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:283)
    at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:291)
    at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:283)
    at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:291)
    at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:283)
    at cc.mallet.types.InstanceList.addThruPipe(InstanceList.java:267)
    at cc.mallet.classify.tui.Text2Vectors.main(Text2Vectors.java:322) 
mimno commented 6 years ago

You might be able to use the "bulk load" feature. It has fewer options, but may be more efficient.

$ bin/mallet bulk-load --help Efficient tool for importing large amounts of text into Mallet format --help TRUE|FALSE Print this command line option usage information. Give argument of TRUE for longer documentation Default is false --prefix-code 'JAVA CODE' Java code you want run before any other interpreted code. Note that the text is interpreted without modification, so unlike some other Java code options, you need to include any necessary 'new's when creating objects. Default is null --config FILE Read command option values from a file Default is null --input FILE The file containing data, one instance per line Default is null --output FILE Write the instance list to this file Default is mallet.data --preserve-case [TRUE|FALSE] If true, do not force all strings to lowercase. Default is false --remove-stopwords [TRUE|FALSE] If true, remove common "stop words" from the text. This option invokes a minimal English stoplist. Default is false --stoplist FILE Read newline-separated words from this file, and remove them from text. This option overrides the default English stoplist triggered by --remove-stopwords. Default is null --keep-sequence [TRUE|FALSE] If true, final data will be a FeatureSequence rather than a FeatureVector. Default is false --line-regex REGEX Regular expression containing regex-groups for label, name and data. Default is ^([^\t])\t([^\t])\t(.) --name INTEGER The index of the group containing the instance name. Use 0 to indicate that this field is not used. Default is 1 --label INTEGER The index of the group containing the label string. Use 0 to indicate that this field is not used. Default is 2 --data INTEGER The index of the group containing the data. Default is 3 --prune-count N Reduce features to those that occur more than N times. Default is 0 --prune-doc-frequency N Remove features that occur in more than (X100)% of documents. 0.05 is equivalent to IDF of 3.0. Default is 1.0