haraldk / TwelveMonkeys

TwelveMonkeys ImageIO: Additional plug-ins and extensions for Java's ImageIO
https://haraldk.github.io/TwelveMonkeys/
BSD 3-Clause "New" or "Revised" License
1.86k stars 310 forks source link

TIFF with JPEG Compression not readable by Tesseract #540

Open keinhaar opened 4 years ago

keinhaar commented 4 years ago

If i create an multipage TIFF with JPEG Compression, it will not be Readable by Tesseract. It gives this Error: "Error in pixReadFromTiffStream: bad tiff file: tiffbpl is too small" Other Compressions like LZW or Deflate work just fine.

Also GIMP gives an Error, but still opens the TIFF. I'll try to Translate the message, because my GIMP is set to german. Something like "Incompatible TIFF: Additional Channels without Field ExtraSamples"

My code looks like this

            Iterator writers = ImageIO.getImageWritersByFormatName("tiff");
            ImageWriter writer = writers.next();
            ImageOutputStream out = ImageIO.createImageOutputStream(target);
            writer.setOutput(out);
            ImageWriteParam param = writer.getDefaultWriteParam();
            param.setCompressionMode(param.MODE_EXPLICIT);
            param.setCompressionType("JPEG");
            writer.prepareWriteSequence(null);
            for(int i=0;i<tiffs.length;i++)
            {
                BufferedImage raster = ImageIO.read(tiffs[i]);
                param.setCompressionQuality(0.9f);
                IIOImage image = new IIOImage(raster, null, null);
                writer.writeToSequence(image, param);
            }
            writer.endWriteSequence();
            writer.dispose();

Is there something wrong with my code?

Schmidor commented 4 years ago

I might be wrong, but could that be something like your source image has 4 channels, RGB and alpha, and the writer has some issues with the alpha channel when writing JPEG compression?

keinhaar commented 4 years ago

No, i checked that. The BufferedImage Type is TYPE_3BYTE_BGR.

Schmidor commented 4 years ago

Could you please attach some samples of the sources and a generated image? I would like to check them. If it's something on the level of the metadata / structure I might be able to help. If its deeper in the JPEG compression you definitely have to wait for Harald :)

keinhaar commented 4 years ago

Here it is. Hopefully this will help to find the issue.

"img001.tif" and "img002.tif" are combined to "target.tiff" by the Java Class.

I try to use tesseract 4.1.1 on the target.tiff like this:

tesseract target.tiff out.pdf pdf

keinhaar commented 4 years ago

Sorry, file was to large. Here again. sample.zip

Schmidor commented 4 years ago

The output file looks pretty normal. Theres nothing unexpected there.

But I can say that Gimp throws this error for every TIFF with JPEG compression I could find. The images are read normally, so I have no idea what it could interpret as an extra sample. Is it possible tesseract just doesn't support JPEG compressed TIFFs?

keinhaar commented 4 years ago

I've saved the img001.tif with GIMP as new tiff file with JPEG compression. Then reopened it. There is no Warning. So this is not always the case. After that i tried tesseract on that file, and it works without problems. so it seems that jpeg compression is supported by tesseract.

haraldk commented 4 years ago

Thanks guys for looking into this!

@keinhaar Can you attach the same image (target.tiff, with both pages), but after re-saving with GIMP, so I can have a look at the differences?

I don't understand the error message from GIMP either, as the TIFF structure has BitsPerSample: [8,8,8], with PhotometricInterpretation: 6/YCbCr, and the JPEG stream has 8 bit precision, 3 components, standard naming for YCbCr (ids 1, 2 and 3).

Opens fine in all the tools I have available. But... There's always the chance that we have missed something.

-- Harald K

Schmidor commented 4 years ago

target-gimp.zip GIMP seems to save some ... interesting ... other things and 4 samples

keinhaar commented 4 years ago

I think Schmidor did it already, but too be safe... target_gimp.tiff.zip That file works without problems with Tesseract.

haraldk commented 4 years ago

Okay...

So GIMP is a bit more sophisticated than our writer, in that it writes JPEGTables and Strips (and stores a lot of "unnecessary" extra information, like document name, thumbnail, Exif and sRGB ICC profile).

But the main differences are it uses photometric RGB, and stores 4 components, where the extra sample is (associated aka premultiplied) alpha, even though the image is fully opaque. I don't know why GIMP does this, or why Tesseract likes this better though... Most software I have, displays these images the same...

We could probably add some options to force RGB mode for JPEGs... And I think you should get 4 components with associated alpha with the reader as-is, if you use TYPE_INT_ARGB_PRE or TYPE_4BYTE_ABGR_PRE for your images.

(Side note: Despite all the extra information, the GIMP file is about half the size of ours... Probably due to higher JPEG compression, but might be worth looking into...)

-- Harald K

haraldk commented 4 years ago

Okay,

I think I found the bug in the Gimp code: file-tiff-load.c:262. It wrongly assumes (from the comment):

All other color space [than RGB] expect 1 channel (grayscale, palette, mask).

That is, it ignores YCbCr (like in our case), Separated (CMYK) and CIELab that have multiple channels...

It seems the only problem is the warning tough, the files (as you mentioned) otherwise loads just fine.

Update: Filed GIMP issue 5081.

-- Harald K

keinhaar commented 4 years ago

Thanks for this deep insights.

I tried to use other Color Model as mentioned, but it gives an Error when writing the final tiff. Seems like the JPEGImageWriter uses some native library, that does not support other color models. (I'm on XUbuntu Linux)

Exception in thread "main" javax.imageio.IIOException: Invalid argument to native writeImage
    at com.sun.imageio.plugins.jpeg.JPEGImageWriter.writeImage(Native Method)
    at com.sun.imageio.plugins.jpeg.JPEGImageWriter.writeOnThread(JPEGImageWriter.java:1067)
    at com.sun.imageio.plugins.jpeg.JPEGImageWriter.write(JPEGImageWriter.java:363)
    at com.twelvemonkeys.imageio.plugins.jpeg.JPEGImageWriter.write(JPEGImageWriter.java:162)
    at com.twelvemonkeys.imageio.plugins.tiff.TIFFImageWriter.writePage(TIFFImageWriter.java:245)
    at com.twelvemonkeys.imageio.plugins.tiff.TIFFImageWriter.writeToSequence(TIFFImageWriter.java:954)
    at de.exware.scan.TiffTool.concatTiffs(TiffTool.java:52)
    at de.exware.scan.TiffTool.main(TiffTool.java:61)
haraldk commented 4 years ago

@keinhaar Thanks for trying that out. Maybe you could post your code as a failing test case, and I'll see if this is something that can be fixed?

And yes, ultimately, JPEG read/write is handled by native code, which for any Oracle JVM is a modified libJPEG AFAIK.

Usually, we can get around those issues by writing a raster instead of the full image, and just populating the metadata correctly ourselves (like I did for CMYK JPEG read/write).

-- Harald K

keinhaar commented 4 years ago

The code is still the same as in sample.zip. I just created an new buffered image of the type you requested, and drawed the original image with the g2d context.