Add option to ignore tags having greater than some threshold of bytes to avoid OutOfMemoryErrors on large files

GoogleCodeExporter commented 9 years ago

I'm running into a problem where I'm getting an out of memory error while 
processing a large tiff file (a little less than 500MB). I'm not sure whether 
metadata-extractor is designed to handle large files like this?

Here is the stack trace:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
  at com.drew.lang.RandomAccessFileReader.getBytes(Unknown Source)
  at com.drew.metadata.exif.ExifReader.processTag(Unknown Source)
  at com.drew.metadata.exif.ExifReader.processDirectory(Unknown Source)
  at com.drew.metadata.exif.ExifReader.extractIFD(Unknown Source)
  at com.drew.metadata.exif.ExifReader.extractTiff(Unknown Source)
  at com.drew.imaging.tiff.TiffMetadataReader.readMetadata(Unknown Source)
  at com.drew.imaging.ImageMetadataReader.readMetadata(Unknown Source)
  at com.drew.imaging.ImageMetadataReader.readMetadata(Unknown Source)
  at digitalfusion.util.TestMetadataReader.main(TestMetadataReader.java:19)

The method calling the above just looks like this:

public static void main(String[] args) throws Exception {
  Metadata metadata = ImageMetadataReader.readMetadata(new File("/path/to/myfile.tif"));
}

I'm using version 2.6.2 with java 1.6 on OS X. The problem is also happening on 
our ubuntu servers (also java 1.6)

Please contact me if you would like a copy of the file for testing.

Thanks,
Michael

Original issue reported on code.google.com by michaelr...@gmail.com on 25 Oct 2012 at 6:04

GoogleCodeExporter commented 9 years ago

Hi Michael,

Thanks for your bug report. Yes, metadata-extractor should be fine with these 
large TIFF files.

Do you have an example image you could send me using 
https://www.wetransfer.com/ or a similar service? Please make sure you have 
permission to release this image.

My guess at this point is that there's a tag that says it contains a very large 
number of bytes, and that the library is faithfully trying to allocate a buffer 
for that data. But without a sample image, it's hard to say for sure.
It would also be useful to know what your JVM settings are.

Also, if you have a few strack traces, can you let me know if they're always 
identical?

Thanks.

Original comment by drewnoakes on 25 Oct 2012 at 6:36

Changed state: Accepted
Added labels: Component-TIFF, Milestone-2.6.2, Motive-Correctness, Priority-High
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

Original comment by drewnoakes on 25 Oct 2012 at 6:37

Added labels: Milestone-2.6.3
Removed labels: Milestone-2.6.2

GoogleCodeExporter commented 9 years ago

Great - glad to hear it's expected to work with large files. I'm sending the 
file to you now, through wetransfer.

The stack traces I've seen have always been the same, occurring in 
RandomAccessFileReader.getBytes()

Original comment by michaelr...@gmail.com on 25 Oct 2012 at 6:47

GoogleCodeExporter commented 9 years ago

Hi Michael,

Thanks for the sample image. It seems my suspicion was correct -- there's a 
single tag of large size that's causing your OOM exception:

[Exif IFD0] Unknown tag (0x935c) = [443064764 bytes]

I don't know what kind of data could have more than 400MB of data in it. 
Perhaps it's actually the raw image data.

My default JVM settings were able to process the file successfully, although it 
allocated that much RAM for the file's contents, which is wasteful when you 
don't want this data anyway.

I propose adding a setting somewhere to optionally allow ignoring tag values 
which are above some certain size. This will allow the library to run more 
efficiently on large image files in memory-constrained environments.

Let me have a think about how best to integrate this change to the API. I'll 
report back here once it's done.

In the meantime, you can modify the source code yourself. A hacky workaround 
would be to insert some code at line 434 of ExifReader.java that resembles 
something like this:

    private void processTag(@NotNull Directory directory, int tagType, int tagValueOffset, int componentCount, int formatCode, @NotNull final BufferReader reader) throws BufferBoundsException
    {
        if (componentCount > 1000*1000)
            return;

Original comment by drewnoakes on 27 Oct 2012 at 11:51

Changed state: Started

GoogleCodeExporter commented 9 years ago

Original comment by drewnoakes on 28 Oct 2012 at 12:26

Changed title: Add option to ignore tags having greater than some threshold of bytes to avoid OutOfMemoryErrors on large files
Added labels: Component-Exif

GoogleCodeExporter commented 9 years ago

Given that this would require an API change, I have opted to push this to 2.7.0.

Original comment by drewnoakes on 28 Oct 2012 at 4:40

Added labels: Milestone-2.7.0
Removed labels: Milestone-2.6.3

GoogleCodeExporter commented 9 years ago

Thanks so much - especially for the quick turnaround.

Original comment by michaelr...@gmail.com on 28 Oct 2012 at 10:22

GoogleCodeExporter commented 9 years ago

This issue has been migrated along with the project to GitHub:

https://github.com/drewnoakes/metadata-extractor/issues/5

Original comment by drewnoakes on 19 Nov 2014 at 12:34

jokiazhang / metadata-extractor

Add option to ignore tags having greater than some threshold of bytes to avoid OutOfMemoryErrors on large files #60