io.text TextReader crashes when reading a large file

GoogleCodeExporter commented 9 years ago

The TextReader crashes when reading a single large file (3,5 GB).

This is the pipeline I used

       SimplePipeline.runPipeline(
                createReaderDescription(TextReader.class,
                        TextReader.PARAM_SOURCE_LOCATION, corpusDirectory+sourceDirectory,
                        TextReader.PARAM_PATTERNS, "*.txt",
                        TextReader.PARAM_LANGUAGE, "de"),
                createEngineDescription(LanguageToolSegmenter.class),
                createEngineDescription(TreeTaggerPosLemmaTT4J.class),
                createEngineDescription(GermanSeparatedParticleAnnotator.class)

                );

DKPro Core version 1.5.0.

Output: 
(run with the following VM args:
-Xms2048m 
-Xmx8000m
)

Feb 26, 2014 10:37:03 AM 
de.tudarmstadt.ukp.dkpro.core.api.io.ResourceCollectionReaderBase 
initialize(233)
Information: Found [1] resources to be read
Feb 26, 2014 10:37:05 AM 
de.tudarmstadt.ukp.dkpro.core.treetagger.TreeTaggerTT4JBase destroy(157)
Information: Cleaning up TreeTagger process
Exception in thread "main" java.lang.OutOfMemoryError: Requested array size 
exceeds VM limit
    at java.util.Arrays.copyOf(Arrays.java:2367)
    at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
    at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:535)
    at java.lang.StringBuilder.append(StringBuilder.java:204)
    at org.apache.commons.io.output.StringBuilderWriter.write(StringBuilderWriter.java:138)
    at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1365)
    at org.apache.commons.io.IOUtils.copy(IOUtils.java:1340)
    at org.apache.commons.io.IOUtils.copy(IOUtils.java:1315)
    at org.apache.commons.io.IOUtils.toString(IOUtils.java:525)
    at de.tudarmstadt.ukp.dkpro.core.io.text.TextReader.getNext(TextReader.java:78)
    at org.apache.uima.fit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:82)
    at org.apache.uima.fit.pipeline.SimplePipeline.runPipeline(SimplePipeline.java:115)
    at de.tudarmstadt.ukp.dkpro.argumentation.analysis.verbs.VerbClassPipeline.main(VerbClassPipeline.java:34)

Original issue reported on code.google.com by eckle.kohler on 26 Feb 2014 at 9:44

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

The problem is that TextReader puts all the content of the file in a String, it 
does not matter how large is your heap size, you cannot create a String that 
big. I suggest to throw a proper exception when dealing with such big files, 
with a nicer message.
Perhaps we could have a reader like "BigTextReader" to deal with these cases..

Original comment by pedrobss...@gmail.com on 26 Feb 2014 at 10:48

GoogleCodeExporter commented 9 years ago

Well, I see nothing wrong with this message. When you read such a large file, 
this is absolutely to be expected. 

Mind, that "-Xms" affects the stack size, not the heap. This might be required 
for deeply recursive algorithms, but not for large strings.

To avoid this problem, either the file would need to be split before, or a 
reader would need to be used that knows how to split the file into sensible 
portions. I do not think that "BigTextReader" captures this sufficiently. 

How do you imagine the reader should know how to split the file(s)?

So the way to "fix" this would be to use a 64bit VM and add more heap memory 
(maybe 16g?) ;) I do not think this is a DKPro Core defect.

Original comment by richard.eckart on 26 Feb 2014 at 11:02

GoogleCodeExporter commented 9 years ago

Ah, yep. That's another one, but I think this problem is not hit here even. 
Also, this would again not be a DKPro Problem, since the UIMA CAS uses a single 
Java String to represent the document text, which does not work for such large 
documents.

See also this thread: http://markmail.org/thread/55vmyfiecdciealx

Original comment by richard.eckart on 26 Feb 2014 at 11:10

GoogleCodeExporter commented 9 years ago

That is why I used 'Perhaps', ;-)
A configuration parameter could be used to set how to split the file, but as 
you mentioned, in the end, UIMA CAS uses a single Java String to represent the 
document text, so it does not work.

Regarding the message, it might be obvious for you and for me, but not for 
everyone, otherwise, this thread would not be happening here ;-)

Original comment by pedrobss...@gmail.com on 26 Feb 2014 at 11:24

GoogleCodeExporter commented 9 years ago

If the reader knew how to split a file, then the problem with hitting the CAS 
limit might not even occur.

Regarding the message: sure we can try to add some sanity check like failing 
with a different message if a file is larger than x bytes. But consider that we 
should then probably do this everywhere - not only in the TextReader. In some 
cases, when a reader operates on a stream, it might not even be possible to 
determine the size. There are many many chances for getting out-of-memory 
exceptions. We cannot handle them all. I see your point, but I think sometimes 
such a thread as this is the better way to deal with errors than trying to 
handle certain kinds of errors in code.

Original comment by richard.eckart on 26 Feb 2014 at 11:33

GoogleCodeExporter commented 9 years ago

thanks a lot for the feedback - I was simply not aware of how this is handled 
internally.

Of course I can split the file myself before processing. (BTW this corpus 
happened to be provided in our internal data repository)

so from my side the issue can be closed

Original comment by eckle.kohler on 26 Feb 2014 at 11:37

GoogleCodeExporter commented 9 years ago

Original comment by richard.eckart on 26 Feb 2014 at 1:14

Changed state: Done
Added labels: Type-Other
Removed labels: Type-Defect

kulukimak / dkpro-core-asl

io.text TextReader crashes when reading a large file #346