Closed GoogleCodeExporter closed 9 years ago
[deleted comment]
The problem is that TextReader puts all the content of the file in a String, it
does not matter how large is your heap size, you cannot create a String that
big. I suggest to throw a proper exception when dealing with such big files,
with a nicer message.
Perhaps we could have a reader like "BigTextReader" to deal with these cases..
Original comment by pedrobss...@gmail.com
on 26 Feb 2014 at 10:48
Well, I see nothing wrong with this message. When you read such a large file,
this is absolutely to be expected.
Mind, that "-Xms" affects the stack size, not the heap. This might be required
for deeply recursive algorithms, but not for large strings.
To avoid this problem, either the file would need to be split before, or a
reader would need to be used that knows how to split the file into sensible
portions. I do not think that "BigTextReader" captures this sufficiently.
How do you imagine the reader should know how to split the file(s)?
So the way to "fix" this would be to use a 64bit VM and add more heap memory
(maybe 16g?) ;) I do not think this is a DKPro Core defect.
Original comment by richard.eckart
on 26 Feb 2014 at 11:02
Ah, yep. That's another one, but I think this problem is not hit here even.
Also, this would again not be a DKPro Problem, since the UIMA CAS uses a single
Java String to represent the document text, which does not work for such large
documents.
See also this thread: http://markmail.org/thread/55vmyfiecdciealx
Original comment by richard.eckart
on 26 Feb 2014 at 11:10
That is why I used 'Perhaps', ;-)
A configuration parameter could be used to set how to split the file, but as
you mentioned, in the end, UIMA CAS uses a single Java String to represent the
document text, so it does not work.
Regarding the message, it might be obvious for you and for me, but not for
everyone, otherwise, this thread would not be happening here ;-)
Original comment by pedrobss...@gmail.com
on 26 Feb 2014 at 11:24
If the reader knew how to split a file, then the problem with hitting the CAS
limit might not even occur.
Regarding the message: sure we can try to add some sanity check like failing
with a different message if a file is larger than x bytes. But consider that we
should then probably do this everywhere - not only in the TextReader. In some
cases, when a reader operates on a stream, it might not even be possible to
determine the size. There are many many chances for getting out-of-memory
exceptions. We cannot handle them all. I see your point, but I think sometimes
such a thread as this is the better way to deal with errors than trying to
handle certain kinds of errors in code.
Original comment by richard.eckart
on 26 Feb 2014 at 11:33
thanks a lot for the feedback - I was simply not aware of how this is handled
internally.
Of course I can split the file myself before processing. (BTW this corpus
happened to be provided in our internal data repository)
so from my side the issue can be closed
Original comment by eckle.kohler
on 26 Feb 2014 at 11:37
Original comment by richard.eckart
on 26 Feb 2014 at 1:14
Original issue reported on code.google.com by
eckle.kohler
on 26 Feb 2014 at 9:44