Aborting OutOfMemoryError caused by TwelveMonkeys TIF reader plugin

lfcnassif commented 9 months ago

Reported by an user, enableExternalParsing = true didn't work around the issue. After taking a look at the heap dump sent by her, 1 MemoryCache object used ~23GB of heap, and it was created by TwelveMonkeys TIFFImageReader:

Thread stack:

pool-3-thread-189
  at java.nio.channels.Channels$ReadableByteChannelImpl.read(Ljava/nio/ByteBuffer;)I (Unknown Source)
  at com.twelvemonkeys.imageio.stream.MemoryCache.readBlock([B)I (MemoryCache.java:77)
  at com.twelvemonkeys.imageio.stream.MemoryCache.fetchBlock()[B (MemoryCache.java:60)
  at com.twelvemonkeys.imageio.stream.MemoryCache.read(Ljava/nio/ByteBuffer;)I (MemoryCache.java:99)
  at com.twelvemonkeys.imageio.stream.BufferedChannelImageInputStream.fillBuffer()Z (BufferedChannelImageInputStream.java:156)
  at com.twelvemonkeys.imageio.stream.BufferedChannelImageInputStream.read()I (BufferedChannelImageInputStream.java:177)
  at com.twelvemonkeys.imageio.metadata.tiff.TIFFReader.isValidOffset(Ljavax/imageio/stream/ImageInputStream;J)Z (TIFFReader.java:368)
  at com.twelvemonkeys.imageio.metadata.tiff.TIFFReader.readLinkedIFDs(Ljavax/imageio/stream/ImageInputStream;)Lcom/twelvemonkeys/imageio/metadata/tiff/TIFFDirectory; (TIFFReader.java:143)
  at com.twelvemonkeys.imageio.metadata.tiff.TIFFReader.read(Ljavax/imageio/stream/ImageInputStream;)Lcom/twelvemonkeys/imageio/metadata/Directory; (TIFFReader.java:132)
  at com.twelvemonkeys.imageio.plugins.tiff.TIFFImageReader.readMetadata()V (TIFFImageReader.java:197)
  at com.twelvemonkeys.imageio.plugins.tiff.TIFFImageReader.readIFD(I)V (TIFFImageReader.java:390)
  at com.twelvemonkeys.imageio.plugins.tiff.TIFFImageReader.getWidth(I)I (TIFFImageReader.java:439)
  at iped.utils.ImageUtil.doGetSubSampledImage(Ljava/lang/Object;IILiped/utils/ImageUtil$BooleanWrapper;Ljava/lang/String;)Ljava/awt/image/BufferedImage; (ImageUtil.java:185)
  at iped.utils.ImageUtil.getSubSampledImage(Ljava/io/InputStream;IILiped/utils/ImageUtil$BooleanWrapper;Ljava/lang/String;)Ljava/awt/image/BufferedImage; (ImageUtil.java:148)
  at iped.engine.task.ImageThumbTask.createImageThumb(Liped/data/IItem;Ljava/io/File;)V (ImageThumbTask.java:295)
  at iped.engine.task.ImageThumbTask$ThumbCreator.run()V (ImageThumbTask.java:270)
  at java.util.concurrent.Executors$RunnableAdapter.call()Ljava/lang/Object; (Unknown Source)
  at java.util.concurrent.FutureTask.run()V (Unknown Source)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V (Unknown Source)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run()V (Unknown Source)
  at java.lang.Thread.run()V (Unknown Source)

wladimirleite commented 9 months ago

The user was using IPED 4.1.x, right? If that was the case, processing with the master branch would be a valid attempt, as the library was upgraded in #1966. Not sure if it is possible, but it would be great to have the triggering TIF.

lfcnassif commented 9 months ago

Yes, she used 4.1.5. I'll try to get the triggering TIF, reproduce and test with master. Thanks @wladimirleite!

PS: another user got OOME with a 80GB heap, I suspect the issue may be the same, still investigating with him...

lfcnassif commented 9 months ago

Yes, she used 4.1.5. I'll try to get the triggering TIF, reproduce and test with master. Thanks @wladimirleite!

By the log, a 37GB!!! TIF file was being processed when the OOME was thrown. Just asked the file to her.

PS: another user got OOME with a 80GB heap, I suspect the issue may be the same, still investigating with him...

No reference to TIF files found in the log sent by the second user that reported OOME.

lfcnassif commented 9 months ago

By the log, a 37GB!!! TIF file was being processed when the OOME was thrown. Just asked the file to her.

I got the file, reproduced with both 4.1.4 and master using 10GB heap. Excluding 12 monkeys TIF imageio plugin fixed the OOME.

wladimirleite commented 9 months ago

I got the file, reproduced with both 4.1.4 and master using 10GB heap. Excluding 12 monkeys TIF imageio pluging fixed the OOME.

@lfcnassif, you can send me the TIF if you like. I have the TwelveMonkeys code/environment set up for testing, so I can try to narrow down where exactly the problem is, and maybe find a workaround or report them an issue.

lfcnassif commented 9 months ago

@lfcnassif, you can send me the TIF if you like. I have the TwelveMonkeys code/environment set up for testing, so I can try to narrow down where exactly the problem is, and maybe find a workaround or report them an issue.

Thanks @wladimirleite! I can send you the file when you return back to work after new year. Obviously it's a corrupted TIF. Just one question: we added 12 monkeys TIF plug-in because it gives better results on well formed TIFs than Java 11 default decoder, right?

wladimirleite commented 9 months ago

Thanks @wladimirleite! I can send you the file when you return back to work after new year.

Sure! I will be back to work on Wednesday.

Obviously it's a corrupted TIF. Just one question: we added 12 monkeys TIF plug-in because it gives better results on well formed TIFs than Java 11 default decoder, right?

In #1530 there is a long discussion and test results about the usage of this plug-in.

lfcnassif commented 9 months ago

In #1530 there is a long discussion and test results about the usage of this plug-in.

Thanks! My memory always tricks me :)

wladimirleite commented 9 months ago

Thanks! My memory always tricks me :)

Mine too! I had to look up :-)

wladimirleite commented 9 months ago

I think I found a workaround (in IPED). I will finish the investigation and update here in the next days.

lfcnassif commented 9 months ago

I remember we enabled ImageIO cache on memory static setting by default, maybe it's related to this issue.

wladimirleite commented 9 months ago

I remember we enabled ImageIO cache on memory static setting by default, maybe it's related to this issue.

It is related! I will post an overview of what is causing the issue, once I fully understand. I think am almost there :-)

wladimirleite commented 7 months ago

Well, let me try to explain what I found out after a lot of tests and code analysis...

I believe the issue is not directly related to TwelveMonkeys TIFFImageReader. The default plugin (com.sun.imageio.plugins.tiff.TIFFImageReader) "rejects" (does not support) this large file. So it will use TwelveMonkeys plugin even in TiffPageParser (where the default plugin is preferred). When we use ImageIO.setUseCache(false), no disk-based cache is used (only memory). That provided a major speed up on image related tasks. On the other hand, when reading from an InputStream, the TIFF image reader needs to keep the already ready content somewhere (in this case memory, as disk cache is disabled), because it may need to read bytes already consumed (like going backwards in the stream). In the case of this large file, the offset of required metadata to decode it is close to its end, so it will try to read it all and store it in the memory. When it reached ~24GB, an OOME was thrown.

ImageInputStream iis = ImageIO.createImageInputStream(source);

In the line above, if source is a File, it will go directly (seek) to the metadata offset, read it, without having to cache anything in memory. If source is a FileInputStream, it will have the same behavior, as the ImageInputStream created is seekable. However, if source is a TikaInputStream, the ImageIO classes do not know how to deal with it, and it treats it as a generic InputStream. In such a case, a cache will be used. With ImageIO.setUseCache(false) it would use memory, while with ImageIO.setUseCache(true) a temporary file would be created.

I think we can improve the current behavior (which always passes a generic InputStream as source). So whenever we already have a file, use the file. If the Item data is loaded in a byte[] (which I guess is the case for smaller files), we can use a ByteArrayImageInputStream. If we don't have a file, for image files with more than N bytes (e.g. 64M), it is better to create a temporary file, to avoid excessive memory consumption.

My idea is to implement a method like item.createImageInputStream(), so this logic does not need to be replicated in different places that may need to create an ImageInputStream of an item.

lfcnassif commented 7 months ago

Great explanation @wladimirleite, thank you!

I think we can improve the current behavior (which always passes a generic InputStream as source). So whenever we already have a file, use the file. If the Item data is loaded in a byte[] (which I guess is the case for smaller files), we can use a ByteArrayImageInputStream. If we don't have a file, for image files with more than N bytes (e.g. 64M), it is better to create a temporary file, to avoid excessive memory consumption.

I agree with this general approach! But I'm not sure about the best API change to put it.

wladimirleite commented 7 months ago

I agree with this general approach! But I'm not sure about the best API change to put it.

Yes, this is very tricky because there are different usages across the application. I am also unsure about that, but I already made some code changes. I still need to test it.... Let me finish the tests, then I will create a PR and you can decide where is the best way to make the changes. Moving the code shouldn't be too hard.

wladimirleite commented 6 months ago

@lfcnassif, after starting and stopping to work on this issue several times, I finally finished the implementation I had in mind. I am running a set of tests and will analyze the results carefully to avoid any regression before submitting a PR, but so far tests seem fine and the code changes were limited to specific points.

The implemented idea was more or less described before, but the basic idea is to get a ImageInputStream from a IItemReader with following priorities:

ByteArrayImageInputStream, if the item is cached (data is available);
FileImageInputStream, if item has a temporary file or has a file;
MemoryCacheImageInputStream, if item's length is less than 128 MB;
Otherwise (very large files) a temporary file is created to avoid memory issues, and then FileImageInputStream is used.

Currently, in most situations we use ImageIO.createImageInputStream(InputStream), which will create a memory cache for the InputStream even if it is already in memory. The new approach should save some memory during evidence processing, and avoid OOME for very large files (like the TIF that motivated this issue). It is also more deterministic as ImageIO.createImageInputStream() sometimes creates TwelveMonkeys classes, while other times uses Java classes, depending on the input.

lfcnassif commented 6 months ago

Great, thank you very much @wladimirleite!

sepinf-inc / IPED

Aborting OutOfMemoryError caused by TwelveMonkeys TIF reader plugin #2033