glencoesoftware / bioformats2raw

Bio-Formats image file format to raw format converter
GNU General Public License v2.0
82 stars 36 forks source link

java.lang.NegativeArraySizeException with large tiff input with zip compression #217

Closed blowekamp closed 4 months ago

blowekamp commented 1 year ago

We are encountering the following error when running bioformats2raw:

2023-09-20 06:33:17,038 [pool-1-thread-1] ERROR c.g.bioformats2raw.Converter - Failure processing chunk; resolution=0 plane=1 xx=4096 yy=0 zz=0 width=1024 height=1024 depth=1
java.lang.NegativeArraySizeException: null
    at ome.codecs.ByteVector.doubleCapacity(ByteVector.java:86)
    at ome.codecs.ByteVector.add(ByteVector.java:75)
    at ome.codecs.ZlibCodec.decompress(ZlibCodec.java:81)
    at ome.codecs.BaseCodec.decompress(BaseCodec.java:194)
    at loci.formats.codec.WrappedCodec.decompress(WrappedCodec.java:86)
    at loci.formats.codec.ZlibCodec.decompress(ZlibCodec.java:48)
    at loci.formats.tiff.TiffCompression.decompress(TiffCompression.java:283)
    at loci.formats.tiff.TiffParser.getTile(TiffParser.java:831)
    at loci.formats.tiff.TiffParser.getSamples(TiffParser.java:1116)
    at loci.formats.tiff.TiffParser.getSamples(TiffParser.java:871)
    at loci.formats.in.MinimalTiffReader.openBytes(MinimalTiffReader.java:312)
    at loci.formats.in.TiffDelegateReader.openBytes(TiffDelegateReader.java:71)
    at loci.formats.FormatReader.openBytes(FormatReader.java:922)
    at loci.formats.ReaderWrapper.openBytes(ReaderWrapper.java:334)
    at loci.formats.ChannelSeparator.openBytes(ChannelSeparator.java:200)
    at loci.formats.ReaderWrapper.openBytes(ReaderWrapper.java:348)
    at loci.formats.MinMaxCalculator.openBytes(MinMaxCalculator.java:269)
    at loci.formats.MinMaxCalculator.openBytes(MinMaxCalculator.java:260)
    at com.glencoesoftware.bioformats2raw.Converter.getTile(Converter.java:1690)
    at com.glencoesoftware.bioformats2raw.Converter.processChunk(Converter.java:1802)
    at com.glencoesoftware.bioformats2raw.Converter.lambda$saveResolutions$4(Converter.java:2004)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

I am suspicious of an integer overflow issue.

A sample input to recreate this image can be recreated with the following Image Magick "convert" command line:

convert magick:logo -resize 19477x30872 -depth 16 -compress zip logo_zip.tiff

The compression and bit dept options appear to be required to reproduce this error.

Our pipeline converts a png to tiff before running bioformats2raw. We have found that adding -define tiff:tile-geometry=128x128 to the convert command not only bypasses the about bug but also improves the performance ~60x. And the slower tiff was still faster and directly processing the original large png.

melissalinkert commented 1 year ago

Thanks for reporting this, @blowekamp. The stack trace indicates that the problem is in the ome-codecs library, which bioformats2raw uses via Bio-Formats. A corresponding issue in ome-codecs is now open: https://github.com/ome/ome-codecs/issues/32. We can't really fix it here, but will need to update the Bio-Formats version in bioformats2raw once a fix is available and released.

A sample input to recreate this image can be recreated with the following Image Magick "convert" command line:

convert magick:logo -resize 19477x30872 -depth 16 -compress zip logo_zip.tiff

That means that ~1.2 GB of pixels are being compressed as a single tile. We really don't recommend doing that in general; this is also effectively what large PNGs have already, so is not expected to help very much as an intermediate conversion step.

We have found that adding -define tiff:tile-geometry=128x128 to the convert command not only bypasses the about bug but also improves the performance ~60x.

That's definitely expected. In the case where the whole input image is compressed as a single tile, that entire tile must be read and decompressed each time bioformats2raw reads a tile for conversion. Input images that use multiple smaller tiles are expected to perform better overall as individual tiles can be read as needed.

blowekamp commented 1 year ago

@melissalinkert Thank you for forwarding the issue to the appropriate project, and response with tips.

Sorry to add on an additional issue here.

We are also converting some CZI files and the processing seems relatively slow. I'm presuming this is for the same reasons of large compressed chunk(s). Is there anything we can do to either preprocess, or just load and decompress the input once to improve performance?

melissalinkert commented 1 year ago

@blowekamp : one thing you might try is checking the optimal tile size reported by Bio-Formats for the .czi files. With showinf -nopix -noflat (included in Bio-Formats command line tools), look for a Tile size = line in the output. You might try setting the tile size in bioformats2raw to that reported tile size; it's not guaranteed, but that may reduce any repeated tile decompressions. Tiles as stored in .czi files often overlap, so requesting a fixed-size tile from the image can require multiple tiles to be read from the file and decompressed.

I'm not aware of a workflow to pre-process .czi files. If adjusting the tile size doesn't make a positive difference, we'd need more details before suggesting other options - the specific command being run, what kind of data is in the .czi file, the specifications of the system on which conversion is being run, and exact conversion times.

melissalinkert commented 4 months ago

Closing, as there aren't any clear next steps in this repository. Please feel free to re-open with additional details if conversions are still troublesome though.