google-code-export / dkpro-tc

Automatically exported from code.google.com/p/dkpro-tc
Other
1 stars 0 forks source link

OutOfMemoryError in MetaTask #65

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
I'm running a BatchTaskCrossValidation with roughly 12k input files. After 
about 3,5k, I get:
java.lang.OutOfMemoryError: Java heap space
    at org.apache.uima.cas.impl.BinaryCasSerDes6.setupReadStream(BinaryCasSerDes6.java:3475)
    at org.apache.uima.cas.impl.BinaryCasSerDes6.setupReadStreams(BinaryCasSerDes6.java:3421)
    at org.apache.uima.cas.impl.BinaryCasSerDes6.deserializeAfterVersion(BinaryCasSerDes6.java:1628)
    at org.apache.uima.cas.impl.BinaryCasSerDes6.deserialize(BinaryCasSerDes6.java:1595)
    at org.apache.uima.cas.impl.Serialization.deserializeCAS(Serialization.java:266)
    at de.tudarmstadt.ukp.dkpro.core.io.bincas.BinaryCasReader.getNext(BinaryCasReader.java:100)
...
This happens even when none of my feature extractors runs a meta collector. 
Anybody with similar experiences?

Original issue reported on code.google.com by daxenber...@gmail.com on 7 Dec 2013 at 6:46

GoogleCodeExporter commented 9 years ago
Should we try it with a non-binary cas format? I still suspect the problem to 
be related to that.

Original comment by oliver.ferschke on 7 Dec 2013 at 6:49

GoogleCodeExporter commented 9 years ago
I'm currently trying that.

Original comment by daxenber...@gmail.com on 7 Dec 2013 at 6:59

GoogleCodeExporter commented 9 years ago
SerializedCasWriter/Reader runs fine on the same data. No memory issues.

Original comment by daxenber...@gmail.com on 9 Dec 2013 at 9:18

GoogleCodeExporter commented 9 years ago
Ah, that's "good".
Of course it would be much better if there were no bincas problems, but at 
least the problem can be worked around rather easily now.

At least I remembered correctly that issue first appeared when we switched to 
bincas.
I propose we switch to serialized CASes directly and file an issue with DKPro 
Core.
Once the issue is solved, we can switch back.

Original comment by oliver.ferschke on 9 Dec 2013 at 9:34

GoogleCodeExporter commented 9 years ago
Sounds like a reasonable way to go. I'll run a couple of further tests before I 
switch back to serialized CASes.

Original comment by daxenber...@gmail.com on 9 Dec 2013 at 9:41

GoogleCodeExporter commented 9 years ago
If you really found a memory leak in the BinaryCasSerDes6, that would be 
something to report to the UIMA issue tracker.

Original comment by richard.eckart on 9 Dec 2013 at 11:30

GoogleCodeExporter commented 9 years ago
Btw. it looks like you are using the BinaryCasReader/Writer, not the 
SerializedCasReader/Writer. That ok, because the BinaryCasReader/Writer should 
be faster and produce smaller data.

Original comment by richard.eckart on 9 Dec 2013 at 11:31

GoogleCodeExporter commented 9 years ago
Yes, we are currently using BinaryCasReader/Writer; and it's causing the memory 
issues. SerializedCasReader/Writer runs fine on the same data (without 
compression).

Original comment by daxenber...@gmail.com on 9 Dec 2013 at 11:36

GoogleCodeExporter commented 9 years ago
This issue was closed by revision r473.

Original comment by daxenber...@gmail.com on 13 Dec 2013 at 1:10

GoogleCodeExporter commented 9 years ago
Problem was solved after switching to format "0" in BinaryCasWriter. Indeed 
looks like a memory leak in BinaryCasSerDes6.

Original comment by daxenber...@gmail.com on 13 Dec 2013 at 1:13

GoogleCodeExporter commented 9 years ago
Re-opening that task.
Starting with Core 1.7.0 and the new UIMA version, format "0" will not work.
Here is Richard's analysis:

Ok, here an (incomplete) explanation:

The data is written in the PreprocessTask using BinaryCasWriter in format 0 
(does not include type system information). This requires that when reading the 
data again the CAS must have been initialized with exactly the same type-system 
as at the time of writing.

The data is read again in the MetaInfoTask / ExtractFeaturesTask. These tasks 
use different components in their pipelines than the PreprocessTask. Yet, for 
some reason, with DKPro Core 1.6.0, the type system induced by these components 
is the same as in the PreprocessTask, but not when using DKPro Core 
1.7.0-SNAPSHOT. Thus, with 1.7.0-SNAPSHOT, data written by the PreprocessTask 
cannot be read by the other tasks and causes this exception.

A workaround is to use format 6+ (includes type system) instead of format 0 in 
the PreprocessTask. I tried it and it worked. I remember that memory issues had 
been reported with format 6+, but it may be worth trying to track these down 
instead of sticking to the fragile setup that uses format 0.

Another workaround could be to write the data using the SerializedCasWriter in 
PreprocessTask - it also preserves the type system but produces larger files. I 
tried this, but I ended up with 0 values in the folds - probably because 
SerializedCasWriter uses some different file naming conventions than 
BinaryCasWriter.

Original comment by torsten....@gmail.com on 15 Apr 2014 at 7:21

GoogleCodeExporter commented 9 years ago
Torsten and I did a heap dump analysis. I believe to have located the problem 
and opened an issue for it in the Apache Jira:

https://issues.apache.org/jira/browse/UIMA-3747

Original comment by richard.eckart on 15 Apr 2014 at 8:49

GoogleCodeExporter commented 9 years ago
This issue has been fixed in the recent snapshot of UIMA 2.6.0 and will be 
incorporated in the next UIMA release.

Original comment by torsten....@gmail.com on 25 Apr 2014 at 9:29

GoogleCodeExporter commented 9 years ago
For those interested: the UIMA 2.6.0 release process has already started. There 
is an issue with the first release candidate which hopefully can be resolved 
before the release (it breaks at least the uimaFIT CpeBuilder in some cases and 
may break much more for us).

Original comment by richard.eckart on 25 Apr 2014 at 9:32