sirfz / tesserocr

A Python wrapper for the tesseract-ocr API
MIT License
2.02k stars 255 forks source link

Missing support for Tesseract5? #338

Closed dickreuter closed 9 months ago

dickreuter commented 11 months ago

Is there no support for tessseract 5?

In this pipeline I install tesseract with chocolatey. That works fine, and it installs tesseract 5, but then tesserocr gives the following error: Supporting tesseract v3.04.00

Collecting tesserocr (from -r requirements.txt (line 31)) Downloading tesserocr-2.6.2.tar.gz (58 kB) ---------------------------------------- 58.9/58.9 kB 3.0 MB/s eta 0:00:00 Installing build dependencies: started Installing build dependencies: finished with status 'done' Getting requirements to build wheel: started Getting requirements to build wheel: finished with status 'error' error: subprocess-exited-with-error

Getting requirements to build wheel did not run successfully. exit code: 1

[54 lines of output] Failed to extract tesseract version number from: tesseract v5.3.3.20231005

leptonica-1.83.1

libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.4) : libpng 1.6.40 : libtiff 4.6.0 : zlib 1.2.13 : libwebp 1.3.2 : libopenjp2 2.5.0

Found AVX2

Found AVX

Found FMA

Found SSE4.1

Found libarchive 3.7.2 zlib/1.3 liblzma/5.4.4 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.5

Found libcurl/8.3.0 Schannel zlib/1.3 brotli/1.1.0 zstd/1.5.5 libidn2/2.3.4 libpsl/0.21.2 (+libidn2/2.3.3) libssh2/1.11.0 Supporting tesseract v3.04.00

winstxnhdw commented 11 months ago

The maintainer is long gone. Anyways, since you are on Windows, you shouldn't need to pre-install Tesseract. For Windows, the Tesseract model is bundled with the tesserocr wheel. See here. You may still need to install the relevant tessdata though.

zdenop commented 11 months ago

tessocr support tesseract 5 - see tesserocr code.

Building tesserocr from source (tesserocr-2.6.2.tar.gz) requires also building tesseract development files (or to build leptonica&tesseract from source), otherwise tesserocr build fails. Details are in Readme.

winstxnhdw commented 11 months ago

He clearly isn't building tesserocr from source, so there's no need for him to install leptonica and tesseract.

dickreuter commented 11 months ago

I’m trying to simply pip install it with a GitHub pipeline. Any help is greatly appreciated.

https://github.com/dickreuter/Poker/blob/master/.github/workflows/windows-build.yml

On Fri, 29 Dec 2023 at 11:42, Winston H. @.***> wrote:

He clearly isn't building tesserocr from source.

— Reply to this email directly, view it on GitHub https://github.com/sirfz/tesserocr/issues/338#issuecomment-1872210930, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJSW7U6FQ4YE4XRWLMCM2DYL3XH7AVCNFSM6AAAAABALY3YRSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZSGIYTAOJTGA . You are receiving this because you authored the thread.Message ID: @.***>

winstxnhdw commented 11 months ago

@dickreuter I have sent you a PR regarding the pipeline.

winstxnhdw commented 11 months ago

Also, I noticed that you have libleptonica and libtesseract in your Ubuntu Docker builds. You can remove them safely for faster builds and a smaller image size as they are now bundled into the tesserocr installation.

zdenop commented 11 months ago

If this is correct:

Downloading tesserocr-2.6.2.tar.gz

then he is for 100% building from source. Maybe not intentionally, but this is source code - not a wheel (binary build)...

winstxnhdw commented 11 months ago

Collecting tesserocr (from -r requirements.txt (line 31))

The log here already tells you that he is doing a pip install from requirements.txt. Also, circling back to your earlier point, there's no need to install leptonica and tesseract anymore. The README is outdated.

I am using tesserocr without installing those dependencies in my Examplify app.

zdenop commented 11 months ago

And??? pip invoke build from source if it did not find a wheel... Are you familiar with the tools you try to use?

zdenop commented 11 months ago

What exactly is outdated in README?

winstxnhdw commented 11 months ago

And??? pip invoke build from source if it did not find a wheel...

Why does this matter? OP is using Windows and installing with pip, obviously expecting a binary build, which there is. Just that the maintainer's setup.py doesn't pull the wheels for Windows for whatever reason.

What exactly is outdated in README?

The entire requirements section. Instead, he should add that to a section specifically for building from source / development.

dickreuter commented 11 months ago

Much appreciated. Merged the PR.

On Fri, 29 Dec 2023 at 13:14, Winston H. @.***> wrote:

@dickreuter https://github.com/dickreuter I have sent you a PR regarding the pipeline.

— Reply to this email directly, view it on GitHub https://github.com/sirfz/tesserocr/issues/338#issuecomment-1872256935, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJSW7WLU27AU4CC3GEFN3TYL4CARAVCNFSM6AAAAABALY3YRSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZSGI2TMOJTGU . You are receiving this because you were mentioned.Message ID: @.***>

zdenop commented 11 months ago

The entire requirements section.

Seriously?? This one?

pip
Download the wheel file corresponding to your Windows platform and Python installation from [simonflueckiger/tesserocr-windows_build/releases](https://github.com/simonflueckiger/tesserocr-windows_build/releases) and install them via:

> pip install <package_name>.whl

Do you understand that text? What is outdated there? Please state facts, not vague accusations.

Just that the maintainer's setup.py doesn't pull

tesserocr (this project where the issue was created) NEVER produced Windows binary version. It was always created externally.

the wheels for Windows for whatever reason.

whatever the reason => the latest Windows wheel is 2.6.0 And it is not a problem if somebody knows how to write requirements.txt correctly.

winstxnhdw commented 11 months ago

It is truly amazing how you missed this entire part

Requires libtesseract (>=3.04) and libleptonica (>=1.71).

On Debian/Ubuntu:

$ apt-get install tesseract-ocr libtesseract-dev libleptonica-dev pkg-config
You may need to manually compile tesseract for a more recent version. Note that you may need to update your LD_LIBRARY_PATH environment variable to point to the right library versions in case you have multiple tesseract/leptonica installations.

tesserocr (this project where the issue was created) NEVER produced Windows binary version. It was always created externally.

Exactly, and that's the problem. If you are going to commit to supporting a platform, the maintainer should do it well.

zdenop commented 11 months ago

It is truly amazing how you missed this entire part

I did not miss it. Is correct and relevant. Or do you claim you can run tesserocr on Debian without these libraries???

Exactly, and that's the problem. If you are going to commit to supporting a platform, the maintainer should do it well.

It is not a problem. E.g. tesseract and leptonica support many platforms but they never provide binary packages, just a source code.

winstxnhdw commented 11 months ago

Or do you claim you can run tesserocr on Debian without these libraries???

I am just saying that there is no longer a need to explicitly install these dependencies. You were even a participant on the PR for this change.

It is not a problem. E.g. tesseract and leptonica support many platforms but they never provide binary packages, just a source code.

We can agree to disagree then. I believe it's the maintainer's responsibility to ensure that the DX for installing their libraries should always be seamless. In one of my projects, I made sure to bundle the nvidia cublas and cudnn libraries along with the wheel. I know some people may argue that it could be a redundant install if the user already has the dependencies installed in the machine, but relying on the user's PATH to properly resolve these dependencies, in my experience and many others, usually just leads to pain.

To reiterate, the only reason why I, and many others are using this library instead of pytesseract is because the OCR engine is bundled within the installation. That can lead to many advantages. For one, I don't have to add a layer to my docker image for installing these dependencies and I don't have to worry about whether my OS has or has not installed the dependencies in the PATH that tesserocr is expecting.

zdenop commented 11 months ago

am just saying that there is no longer a need to explicitly install

... untill you start to face the problems - see e.g. https://github.com/sirfz/tesserocr/issues/337. Other problems were reported for Mac. Distributing own binary libraries on Linux is not a good idea. Linux philosophy is using system shared libraries => tesserocr should be linked against system leptonica and tesseract and not against their custom build.
pip install --no-binary tesserocr tesserocr is the right way to install tesseroct on Linux and similar systems (MacOS, Freebsd). Windows is the other problem because ... it is Windows.

...pytesseract is because the OCR engine is bundled within the installation

pytesseract does not bundle OCR - it wraps tesseract executable (e.g. you need to install tesseract separately) while tesserocr wraps (and links) tesseract library. As far as I understand pytesseract decided to go this way to avoid problems with distributing binary libraries, dependancies, security etc. (e.g. it leaves all problems to tesseract packagers)...

I believe it's the maintainer's responsibility to ensure that the DX for installing their libraries should always be seamless

No. It is a packager responsibility. Packager != maintainer. There is a split of tasks and responsibilities and it is right. GTK, pango, gnome, KDE maintainers do not care if you are able to install their products/libraries on Windows etc... The same problem is with Windows or Mac OS apps&libs.

winstxnhdw commented 11 months ago

pytesseract does not bundle OCR - it wraps tesseract executable (e.g. you need to install tesseract separately) while tesserocr wraps (and links) tesseract library.

You misread me. I am saying that I prefer tesserocr over pytesseract because it links the tesseract library.

... untill you start to face the problems - see e.g. https://github.com/sirfz/tesserocr/issues/337.

Is this issue not because the maintainer failed to properly pre-compile tesseract in the proper environment?

GTK, pango, gnome, KDE maintainers do not care if you are able to install their products/libraries on Windows etc..

And you're right, they don't have to because they do not explicitly support these platforms. This is unlike tesserocr which explicitly mentions support for these platforms in the README. In this case, this library is playing the role of the Packager.

All I am saying is that tesserocr's DX is almost there. Just update the README and fix the automated CIs that pre-compile the tesseract library so that everyone gets the full-feature set.