Closed dickreuter closed 9 months ago
The maintainer is long gone. Anyways, since you are on Windows, you shouldn't need to pre-install Tesseract. For Windows, the Tesseract model is bundled with the tesserocr
wheel. See here. You may still need to install the relevant tessdata
though.
tessocr support tesseract 5 - see tesserocr code.
Building tesserocr from source (tesserocr-2.6.2.tar.gz) requires also building tesseract development files (or to build leptonica&tesseract from source), otherwise tesserocr build fails. Details are in Readme.
He clearly isn't building tesserocr
from source, so there's no need for him to install leptonica
and tesseract
.
I’m trying to simply pip install it with a GitHub pipeline. Any help is greatly appreciated.
https://github.com/dickreuter/Poker/blob/master/.github/workflows/windows-build.yml
On Fri, 29 Dec 2023 at 11:42, Winston H. @.***> wrote:
He clearly isn't building tesserocr from source.
— Reply to this email directly, view it on GitHub https://github.com/sirfz/tesserocr/issues/338#issuecomment-1872210930, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJSW7U6FQ4YE4XRWLMCM2DYL3XH7AVCNFSM6AAAAABALY3YRSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZSGIYTAOJTGA . You are receiving this because you authored the thread.Message ID: @.***>
@dickreuter I have sent you a PR regarding the pipeline.
Also, I noticed that you have libleptonica
and libtesseract
in your Ubuntu Docker builds. You can remove them safely for faster builds and a smaller image size as they are now bundled into the tesserocr
installation.
If this is correct:
Downloading tesserocr-2.6.2.tar.gz
then he is for 100% building from source. Maybe not intentionally, but this is source code - not a wheel (binary build)...
Collecting tesserocr (from -r requirements.txt (line 31))
The log here already tells you that he is doing a pip
install from requirements.txt
. Also, circling back to your earlier point, there's no need to install leptonica
and tesseract
anymore. The README is outdated.
I am using tesserocr
without installing those dependencies in my Examplify app.
And??? pip invoke build from source if it did not find a wheel... Are you familiar with the tools you try to use?
What exactly is outdated in README?
And??? pip invoke build from source if it did not find a wheel...
Why does this matter? OP is using Windows and installing with pip
, obviously expecting a binary build, which there is. Just that the maintainer's setup.py
doesn't pull the wheels for Windows for whatever reason.
What exactly is outdated in README?
The entire requirements section. Instead, he should add that to a section specifically for building from source / development.
Much appreciated. Merged the PR.
On Fri, 29 Dec 2023 at 13:14, Winston H. @.***> wrote:
@dickreuter https://github.com/dickreuter I have sent you a PR regarding the pipeline.
— Reply to this email directly, view it on GitHub https://github.com/sirfz/tesserocr/issues/338#issuecomment-1872256935, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJSW7WLU27AU4CC3GEFN3TYL4CARAVCNFSM6AAAAABALY3YRSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZSGI2TMOJTGU . You are receiving this because you were mentioned.Message ID: @.***>
The entire requirements section.
Seriously?? This one?
pip
Download the wheel file corresponding to your Windows platform and Python installation from [simonflueckiger/tesserocr-windows_build/releases](https://github.com/simonflueckiger/tesserocr-windows_build/releases) and install them via:
> pip install <package_name>.whl
Do you understand that text? What is outdated there? Please state facts, not vague accusations.
Just that the maintainer's setup.py doesn't pull
tesserocr (this project where the issue was created) NEVER produced Windows binary version. It was always created externally.
the wheels for Windows for whatever reason.
whatever the reason => the latest Windows wheel is 2.6.0 And it is not a problem if somebody knows how to write requirements.txt correctly.
It is truly amazing how you missed this entire part
Requires libtesseract (>=3.04) and libleptonica (>=1.71).
On Debian/Ubuntu:
$ apt-get install tesseract-ocr libtesseract-dev libleptonica-dev pkg-config
You may need to manually compile tesseract for a more recent version. Note that you may need to update your LD_LIBRARY_PATH environment variable to point to the right library versions in case you have multiple tesseract/leptonica installations.
tesserocr (this project where the issue was created) NEVER produced Windows binary version. It was always created externally.
Exactly, and that's the problem. If you are going to commit to supporting a platform, the maintainer should do it well.
It is truly amazing how you missed this entire part
I did not miss it. Is correct and relevant. Or do you claim you can run tesserocr on Debian without these libraries???
Exactly, and that's the problem. If you are going to commit to supporting a platform, the maintainer should do it well.
It is not a problem. E.g. tesseract and leptonica support many platforms but they never provide binary packages, just a source code.
Or do you claim you can run tesserocr on Debian without these libraries???
I am just saying that there is no longer a need to explicitly install these dependencies. You were even a participant on the PR for this change.
It is not a problem. E.g. tesseract and leptonica support many platforms but they never provide binary packages, just a source code.
We can agree to disagree then. I believe it's the maintainer's responsibility to ensure that the DX for installing their libraries should always be seamless. In one of my projects, I made sure to bundle the nvidia cublas and cudnn libraries along with the wheel. I know some people may argue that it could be a redundant install if the user already has the dependencies installed in the machine, but relying on the user's PATH to properly resolve these dependencies, in my experience and many others, usually just leads to pain.
To reiterate, the only reason why I, and many others are using this library instead of pytesseract
is because the OCR engine is bundled within the installation. That can lead to many advantages. For one, I don't have to add a layer to my docker image for installing these dependencies and I don't have to worry about whether my OS has or has not installed the dependencies in the PATH that tesserocr
is expecting.
am just saying that there is no longer a need to explicitly install
... untill you start to face the problems - see e.g. https://github.com/sirfz/tesserocr/issues/337. Other problems were reported for Mac. Distributing own binary libraries on Linux is not a good idea. Linux philosophy is using system shared libraries => tesserocr should be linked against system leptonica and tesseract and not against their custom build.
pip install --no-binary tesserocr tesserocr
is the right way to install tesseroct on Linux and similar systems (MacOS, Freebsd). Windows is the other problem because ... it is Windows.
...pytesseract is because the OCR engine is bundled within the installation
pytesseract
does not bundle OCR - it wraps tesseract executable (e.g. you need to install tesseract separately) while tesserocr
wraps (and links) tesseract library. As far as I understand pytesseract decided to go this way to avoid problems with distributing binary libraries, dependancies, security etc. (e.g. it leaves all problems to tesseract packagers)...
I believe it's the maintainer's responsibility to ensure that the DX for installing their libraries should always be seamless
No. It is a packager responsibility. Packager != maintainer. There is a split of tasks and responsibilities and it is right. GTK, pango, gnome, KDE maintainers do not care if you are able to install their products/libraries on Windows etc... The same problem is with Windows or Mac OS apps&libs.
pytesseract does not bundle OCR - it wraps tesseract executable (e.g. you need to install tesseract separately) while tesserocr wraps (and links) tesseract library.
You misread me. I am saying that I prefer tesserocr
over pytesseract
because it links the tesseract
library.
... untill you start to face the problems - see e.g. https://github.com/sirfz/tesserocr/issues/337.
Is this issue not because the maintainer failed to properly pre-compile tesseract
in the proper environment?
GTK, pango, gnome, KDE maintainers do not care if you are able to install their products/libraries on Windows etc..
And you're right, they don't have to because they do not explicitly support these platforms. This is unlike tesserocr
which explicitly mentions support for these platforms in the README. In this case, this library is playing the role of the Packager
.
All I am saying is that tesserocr
's DX is almost there. Just update the README and fix the automated CIs that pre-compile the tesseract
library so that everyone gets the full-feature set.
Is there no support for tessseract 5?
In this pipeline I install tesseract with chocolatey. That works fine, and it installs tesseract 5, but then tesserocr gives the following error: Supporting tesseract v3.04.00
Collecting tesserocr (from -r requirements.txt (line 31)) Downloading tesserocr-2.6.2.tar.gz (58 kB) ---------------------------------------- 58.9/58.9 kB 3.0 MB/s eta 0:00:00 Installing build dependencies: started Installing build dependencies: finished with status 'done' Getting requirements to build wheel: started Getting requirements to build wheel: finished with status 'error' error: subprocess-exited-with-error
Getting requirements to build wheel did not run successfully. exit code: 1
[54 lines of output] Failed to extract tesseract version number from: tesseract v5.3.3.20231005
leptonica-1.83.1
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found libarchive 3.7.2 zlib/1.3 liblzma/5.4.4 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.5
Found libcurl/8.3.0 Schannel zlib/1.3 brotli/1.1.0 zstd/1.5.5 libidn2/2.3.4 libpsl/0.21.2 (+libidn2/2.3.3) libssh2/1.11.0 Supporting tesseract v3.04.00