madmaze / pytesseract

A Python wrapper for Google Tesseract
Apache License 2.0
5.76k stars 715 forks source link

Added support for image URLs #526

Open marosstruk opened 9 months ago

marosstruk commented 9 months ago

Fixes #415 In the spirit of the repo being a wrapper, I left URL validation to the tesseract-ocr process. If you think it is something that should be handled here, let me know.

stefan6419846 commented 9 months ago

We should probably ensure that the current Tesseract version is indeed capable of using URLs: This requires a minimum version of Tesseract and the binary being linked to libcurl (which can be disabled at build time).

marosstruk commented 9 months ago

I added checks on Tesseract version and presence of libcurl. Not sure why the PR checks are still failing though.

Sources: https://tesseract-ocr.github.io/tessdoc/ReleaseNotes.html#v411 https://github.com/tesseract-ocr/tesseract/blob/75e6c3ea4c8eae740fb65a84e77dbf0c8d092240/src/api/baseapi.cpp#L1148-L1182 https://groups.google.com/g/tesseract-ocr/c/21bU5swaSnQ/m/bQ1UR7ngIgAJ?pli=1

stefan6419846 commented 9 months ago

URL support is a compile-time feature as previously mentioned: https://github.com/tesseract-ocr/tesseract/blob/ea0b245c43ee850f1e571d469b313b90d58d8b13/CMakeLists.txt#L101 Ubuntu < 23.04 just does not link against libcurl during build-time: https://packages.ubuntu.com/jammy/tesseract-ocr https://packages.ubuntu.com/lunar/tesseract-ocr (See control files of the source packages as well.)

marosstruk commented 9 months ago

Ah I see what you mean. The logic I added should correctly check if Tesseract was built with libcurl, so I can add it to the testcase as well. Granted that would make it never run until GitHub adds Ubuntu 23.04 as host for actions and the action config is updated, but at least it will allow people who compiled Tesseract themselves use the functionality.

stefan6419846 commented 9 months ago

We should probably use the image from GitHub (for example by uploading it inside a comment here) to not rely on external services.

For testing: I am not sure whether we already want to test this here or add a conditional skip. Compiling Tesseract on GitHub actions would work, but probably mean quite some overhead for each build.

marosstruk commented 9 months ago

I found out that the URL of files attached in comments is actually dynamic and changes over time, so I used an URL pointing to an image stored in the repo itself (had to split the URL to pass max line length check). I also added a test to verify the URLSupportException.

I agree that compiling tesseract for the testcases might be a bit too much. Tbh, there are already other test cases that are skipped due to version atm, so I don't think its a big issue (its good futureproofing to have the testcase there), but ofc the final decision lies with the reviewer.

Just for completeness, I am attaching screenshot of the testcase passing on my machine: test_image_to_string_with_url

marosstruk commented 9 months ago

@stefan6419846 Please let me know your thoughts based on my previous comment