madmaze / pytesseract

A Python wrapper for Google Tesseract
Apache License 2.0
5.9k stars 725 forks source link

Check whether tesseract supports jpeg2000 or not #419

Open polyzen opened 2 years ago

polyzen commented 2 years ago

pytesseract.pytesseract.TesseractError: (1, 'Error in pixReadStreamJp2k: function not present Error in pixReadStream: jp2: no pix returned Error in pixRead: pix not read Error during processing.')

pytesseract 0.3.10 tesseract 5.1.0 pillow 9.0.1 openjpeg2 2.4.0 pytest 7.1.0 python 3.10.2

Old title: test_image_to_string_with_image_type[jpeg2000] failure with tesseract >4.1.x

bozhodimitrov commented 2 years ago

This error is related to tesseract itself - which version that? Also, is there a sample image that causes that error?

polyzen commented 2 years ago

Oh right: tesseract 5.1.0

The image used by the test: https://github.com/madmaze/pytesseract/blob/v0.3.10/tests/data/test.jpeg2000

bozhodimitrov commented 2 years ago

Well, hmmm. CI on master passes, so not shure what is going on there. PS: Yep, your tesseract version is new enough and CI still uses 4.1.x

At this point, I would check what changed in 5.1.0 in order to not support jpeg2000, because clearly 4.x works with jpeg2000. It might be the imaging library support in Tesseract or something like that.

Have you tried using tesseract directly with the jpeg2000 image?

polyzen commented 2 years ago

Have you tried using tesseract directly with the jpeg2000 image?

I haven't yet used tesseract, I only build pytesseract to provide as an optional dependency for urlwatch in the Arch repos.

bozhodimitrov commented 2 years ago

At the moment, I don't have tesseract 5.1.0 around + Arch instance in order to test if it is pytesseract related or tesseract specific issue. When I have time, I will try to boot up a container with that setup in order to check.

mandree commented 2 years ago

Same issue here. I debugged it, and in my case the root cause was determined as follows:

The remedy for me was to recompile leptonica with OpenJPEG 2.4.0 support.

However for py-pytesseract, it should skip the test if there are indications that tesseract does not support JPEG2000.

bozhodimitrov commented 2 years ago

Thank you for investigating that @mandree - I am not sure if there is a nice way to ask tesseract if that is the case or not. Sadly pytesseract is designed as a thin wrapper around the tesseract executable and doesn't provide any feel integration.

mandree commented 2 years ago

You can query tesseract with -v or --version apparently. See the line right below leptonica, it mentions liboopenjp2 (or not).

First two examples from FreeBSD 13.0 amd64, third and last example on Fedora 35 x86_64.

With JPEG2000 support:

$ tesseract -v 
tesseract 5.1.0
 leptonica-1.82.0
  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.3) : libpng 1.6.37+apng : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
 Found OpenMP 201811
 Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.5.2
 Found libcurl/7.82.0 OpenSSL/1.1.1k zlib/1.2.11 libssh2/1.10.0 nghttp2/1.46.0

And without:

$ tesseract -v 
tesseract 5.1.0
 leptonica-1.82.0
  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.3) : libpng 1.6.37+apng : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2
 Found OpenMP 201811
 Found libarchive 3.6.0 zlib/1.2.11 liblzma/5.2.5 bz2lib/1.0.8 liblz4/1.9.3 libzstd/1.5.2
 Found libcurl/7.82.0 OpenSSL/1.1.1k zlib/1.2.11 libssh2/1.10.0 nghttp2/1.46.0

Fedora Linux:

$ tesseract -v
tesseract 4.1.3
 leptonica-1.81.1
  libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 2.1.0) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2
 Found AVX2
 Found AVX
 Found FMA
 Found SSE
mandree commented 2 days ago

Hi @bozhodimitrov you marked this as completed but I do not see relevant commits nor a comment. In what way was this fixed?

bozhodimitrov commented 22 hours ago

Hi @bozhodimitrov you marked this as completed but I do not see relevant commits nor a comment. In what way was this fixed?

Hi, old issue + it is just closed, not completed + pytesseract does the right thing to notify the users of the underlying tesseract error. From there on it is responsibility of the user to update their stack with supported third-party components.

Unless you want to make a PR with parsing all supported formats while invoking the --version and then do the actual checking and deactivating functions, and handling the response to the user and adding tests for all of this, then all I can do is to convert it to a Feature Request.

The current error report is enough for all users that search for this specific error to find the workaround that you all shared. Which means that there is no point of this issue staying open anymore.

Let me know what you think.