Java heap space Error : while importing large data

mimno / Mallet

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

Other

984 stars 344 forks source link

java.lang.OutOfMemoryError: Java heap space at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:68) at java.lang.StringBuffer.<init>(StringBuffer.java:128) at cc.mallet.pipe.Input2CharSequence.pipe(Input2CharSequence.java:94) at cc.mallet.pipe.Input2CharSequence.pipe(Input2CharSequence.java:83) at cc.mallet.pipe.Input2CharSequence.pipe(Input2CharSequence.java:47) at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:295) at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:283) at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:291) at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:283) at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:291) at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:283) at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:291) at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:283) at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:291) at cc.mallet.pipe.Pipe$SimplePipeInstanceIterator.next(Pipe.java:283) at cc.mallet.types.InstanceList.addThruPipe(InstanceList.java:267) at cc.mallet.classify.tui.Text2Vectors.main(Text2Vectors.java:322)

You might be able to use the "bulk load" feature. It has fewer options, but may be more efficient.

$ bin/mallet bulk-load --help Efficient tool for importing large amounts of text into Mallet format --help TRUE|FALSE Print this command line option usage information. Give argument of TRUE for longer documentation Default is false --prefix-code 'JAVA CODE' Java code you want run before any other interpreted code. Note that the text is interpreted without modification, so unlike some other Java code options, you need to include any necessary 'new's when creating objects. Default is null --config FILE Read command option values from a file Default is null --input FILE The file containing data, one instance per line Default is null --output FILE Write the instance list to this file Default is mallet.data --preserve-case [TRUE|FALSE] If true, do not force all strings to lowercase. Default is false --remove-stopwords [TRUE|FALSE] If true, remove common "stop words" from the text. This option invokes a minimal English stoplist. Default is false --stoplist FILE Read newline-separated words from this file, and remove them from text. This option overrides the default English stoplist triggered by --remove-stopwords. Default is null --keep-sequence [TRUE|FALSE] If true, final data will be a FeatureSequence rather than a FeatureVector. Default is false --line-regex REGEX Regular expression containing regex-groups for label, name and data. Default is ^([^\t])\t([^\t])\t(.) --name INTEGER The index of the group containing the instance name. Use 0 to indicate that this field is not used. Default is 1 --label INTEGER The index of the group containing the label string. Use 0 to indicate that this field is not used. Default is 2 --data INTEGER The index of the group containing the data. Default is 3 --prune-count N Reduce features to those that occur more than N times. Default is 0 --prune-doc-frequency N Remove features that occur in more than (X100)% of documents. 0.05 is equivalent to IDF of 3.0. Default is 1.0

mimno / Mallet

Java heap space Error : while importing large data #129