Open keinhaar opened 4 years ago
I might be wrong, but could that be something like your source image has 4 channels, RGB and alpha, and the writer has some issues with the alpha channel when writing JPEG compression?
No, i checked that. The BufferedImage Type is TYPE_3BYTE_BGR.
Could you please attach some samples of the sources and a generated image? I would like to check them. If it's something on the level of the metadata / structure I might be able to help. If its deeper in the JPEG compression you definitely have to wait for Harald :)
Here it is. Hopefully this will help to find the issue.
"img001.tif" and "img002.tif" are combined to "target.tiff" by the Java Class.
I try to use tesseract 4.1.1 on the target.tiff like this:
tesseract target.tiff out.pdf pdf
Sorry, file was to large. Here again. sample.zip
The output file looks pretty normal. Theres nothing unexpected there.
But I can say that Gimp throws this error for every TIFF with JPEG compression I could find. The images are read normally, so I have no idea what it could interpret as an extra sample. Is it possible tesseract just doesn't support JPEG compressed TIFFs?
I've saved the img001.tif with GIMP as new tiff file with JPEG compression. Then reopened it. There is no Warning. So this is not always the case. After that i tried tesseract on that file, and it works without problems. so it seems that jpeg compression is supported by tesseract.
Thanks guys for looking into this!
@keinhaar Can you attach the same image (target.tiff, with both pages), but after re-saving with GIMP, so I can have a look at the differences?
I don't understand the error message from GIMP either, as the TIFF structure has BitsPerSample: [8,8,8]
, with PhotometricInterpretation: 6/YCbCr
, and the JPEG stream has 8 bit precision, 3 components, standard naming for YCbCr (ids 1, 2 and 3).
Opens fine in all the tools I have available. But... There's always the chance that we have missed something.
-- Harald K
target-gimp.zip GIMP seems to save some ... interesting ... other things and 4 samples
I think Schmidor did it already, but too be safe... target_gimp.tiff.zip That file works without problems with Tesseract.
Okay...
So GIMP is a bit more sophisticated than our writer, in that it writes JPEGTables and Strips (and stores a lot of "unnecessary" extra information, like document name, thumbnail, Exif and sRGB ICC profile).
But the main differences are it uses photometric RGB, and stores 4 components, where the extra sample is (associated aka premultiplied) alpha, even though the image is fully opaque. I don't know why GIMP does this, or why Tesseract likes this better though... Most software I have, displays these images the same...
We could probably add some options to force RGB mode for JPEGs... And I think you should get 4 components with associated alpha with the reader as-is, if you use TYPE_INT_ARGB_PRE
or TYPE_4BYTE_ABGR_PRE
for your images.
(Side note: Despite all the extra information, the GIMP file is about half the size of ours... Probably due to higher JPEG compression, but might be worth looking into...)
-- Harald K
Okay,
I think I found the bug in the Gimp code: file-tiff-load.c:262. It wrongly assumes (from the comment):
All other color space [than RGB] expect 1 channel (grayscale, palette, mask).
That is, it ignores YCbCr (like in our case), Separated (CMYK) and CIELab that have multiple channels...
It seems the only problem is the warning tough, the files (as you mentioned) otherwise loads just fine.
Update: Filed GIMP issue 5081.
-- Harald K
Thanks for this deep insights.
I tried to use other Color Model as mentioned, but it gives an Error when writing the final tiff. Seems like the JPEGImageWriter uses some native library, that does not support other color models. (I'm on XUbuntu Linux)
Exception in thread "main" javax.imageio.IIOException: Invalid argument to native writeImage at com.sun.imageio.plugins.jpeg.JPEGImageWriter.writeImage(Native Method) at com.sun.imageio.plugins.jpeg.JPEGImageWriter.writeOnThread(JPEGImageWriter.java:1067) at com.sun.imageio.plugins.jpeg.JPEGImageWriter.write(JPEGImageWriter.java:363) at com.twelvemonkeys.imageio.plugins.jpeg.JPEGImageWriter.write(JPEGImageWriter.java:162) at com.twelvemonkeys.imageio.plugins.tiff.TIFFImageWriter.writePage(TIFFImageWriter.java:245) at com.twelvemonkeys.imageio.plugins.tiff.TIFFImageWriter.writeToSequence(TIFFImageWriter.java:954) at de.exware.scan.TiffTool.concatTiffs(TiffTool.java:52) at de.exware.scan.TiffTool.main(TiffTool.java:61)
@keinhaar Thanks for trying that out. Maybe you could post your code as a failing test case, and I'll see if this is something that can be fixed?
And yes, ultimately, JPEG read/write is handled by native code, which for any Oracle JVM is a modified libJPEG AFAIK.
Usually, we can get around those issues by writing a raster instead of the full image, and just populating the metadata correctly ourselves (like I did for CMYK JPEG read/write).
-- Harald K
The code is still the same as in sample.zip. I just created an new buffered image of the type you requested, and drawed the original image with the g2d context.
If i create an multipage TIFF with JPEG Compression, it will not be Readable by Tesseract. It gives this Error: "Error in pixReadFromTiffStream: bad tiff file: tiffbpl is too small" Other Compressions like LZW or Deflate work just fine.
Also GIMP gives an Error, but still opens the TIFF. I'll try to Translate the message, because my GIMP is set to german. Something like "Incompatible TIFF: Additional Channels without Field ExtraSamples"
My code looks like this
Is there something wrong with my code?