sirfz / tesserocr

A Python wrapper for the tesseract-ocr API
MIT License
2k stars 254 forks source link

Quality degradation due to PIL.Image.save #207

Open bertsky opened 4 years ago

bertsky commented 4 years ago

@sirfz you have mentioned before that using SetImageFile can be better than SetImage when doing layout analysis. I can fully confirm that. There's a big difference for JPEG files between CLI and API segmentation results. That difference vanishes when using file I/O. I have been trying to arbitrarily set the format attribute of the PIL.Image object I am passing to SetImage to different formats, but it does not help.

Here is an example: filemax00005

If I process that with the CLI (with ALTO renderer to see the segmentation) – or with API and SetImageFilename() of the JPEG – I get the following (suboptimal but acceptable) result: tesseract-alto-raw

But with SetImage() it always degrades to: ocrd-segment-region-with-bbox-raw

Could that be due to recompression artifacts? Is there anything we could do within _image_buffer as a remedy?

sirfz commented 4 years ago

I'm not familiar with PIL's internals but it does seem to apply some kind of post processing which is altering the image quality, maybe you'd be able to find something about this in PIL's documentation.

bertsky commented 4 years ago

Note: I have been testing whether the problem lies in PIL.Image.save() in _image_buffer: I tried setting compress_level, optimize and dpi for the in-memory PNG serialization. I have even set up SetImageBytes (with bytes_per_pixel according to the image mode). All those efforts had no effect.

I'm not familiar with PIL's internals but it does seem to apply some kind of post processing which is altering the image quality, maybe you'd be able to find something about this in PIL's documentation.

Yes, I am still searching for that. Meanwhile, I found the very recent https://github.com/python-pillow/Pillow/issues/3952, which looks very promising/suspicious indeed, but have not been able to verify this for our case. But maybe my method is still incorrect: I converted the JPG image (which I assume is truly sRGB, but JPEG does not discern non-/linear) via ImageMagick to a TIFF with linear RGB colorspace. (I.e. convert input.jpg -colorspace RGB output.tif. So if the input was actually already linear, this double conversion would now effectively reduce the contrast.) The resulting segmentation is this: ocrd-segment-region-with-bbox-raw-linearrgb

This is different from the segmentation I get on the CLI and with SegImageFilename (see above). (And yes, you could say it is worse.)

bertsky commented 4 years ago

I believe I can also rule out https://github.com/python-pillow/Pillow/issues/3651, because the problem remains even with Pillow 7.0 built against libjpeg9-dev (PIL.Image.core.jpeglib_version == '9.0').

bertsky commented 4 years ago

Also, interestingly, the thresholded image generated internally (from global Otsu binarization) does not look all that different between CLI/SetImageFilename and API/SetImage – the only directly perceptible differences are at the vertical separator lines:

phys00005_from_SetImageFile phys00005_from_SetImage

The difference (IM compare) also shows quite a lot of noise spreading over the blob boxes, which I cannot explain (but they must come from the JPEG compression): difference2

amitdo commented 4 years ago

https://pillow.readthedocs.io/en/5.1.x/handbook/image-file-formats.html#jpeg

The save() method supports the following options:

quality The image quality, on a scale from 1 (worst) to 95 (best). The default is 75. Values above 95 should be avoided; 100 disables portions of the JPEG compression algorithm, and results in large files with hardly any gain in image quality.

bertsky commented 4 years ago

@amitdo Thanks for your proposal, but the context here is already saving to PNG to prevent any further quality degradation on the way from PIL.Image to leptonica.pix. I did not even consider saving as JPEG because that would inevitably make matters worse (as there is no lossless setting as in J2K or PNG).

Just did the tests again and can still confirm everything I said above.

bertsky commented 3 years ago

BTW, another lead that I had 1yr ago was that in Leptonica's pngio.c there were some subtle but significant differences between its pixReadMemPng and pixReadStreamPng (which IIRC appeared to have been caused by changes to the one not being synchronized to the other over the last 4 years). This does also explain why the same difference arises between SetImage (from BytesIO with Image.save(format='PNG') and SetImageFile (also saving to PNG).

If I find the time, I'll investigate further (and probably propose a fix in Leptonica).