Allow tesseract 4.0.0alpha to be used with pyocr

openpaperwork / pyocr

A Python wrapper for Tesseract and Cuneiform -- Moved to Gnome's Gitlab

https://gitlab.gnome.org/World/OpenPaperwork/pyocr

931 stars 152 forks source link

Allow tesseract 4.0.0alpha to be used with pyocr #66

Closed ddddavidmartin closed 7 years ago

ddddavidmartin commented 7 years ago

The current tesseract 4.0 version is still in alpha and returns the version string tesseract 4.00.00alpha. This breaks the existing get_version function as it expects integer values only.

To work around it this pull request simply only takes the starting digits of the version and returns these.

Note: I haven't really tried out how pyocr fares with tesseract 4. But, I am using it with paperless and it seems to be working fine for me so far.

How to test this:

build and install the current tesseract 4.0.0alpha
start consumption with paperless for example
the current pyocr version fails with pyocr.error.TesseractError: (0, 'Unable to parse Tesseract version (not a number): [4.00.00alpha]')

ddddavidmartin commented 7 years ago

Attached is the test output when run with tesseract 4 for what it is worth. test_output.txt

jflesch commented 7 years ago

First of all, thank you for this contribution.

However, pretty much all the tests are failing because they can't find the data language files:

pyocr.error.TesseractError: (1, b'Error opening data file /usr/local/share/fra.traineddata\n
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata"
directory.\n
Failed loading language \'fra\'\nTesseract couldn\'t load any languages!\nCould not
initialize tesseract.\n')

You need French and Japanese data files (fra and jpn). I would feel more confident if you could pass these tests too before I merge your changes.

jflesch commented 7 years ago

Actually, after reading your changes, they pretty much can't make things worse .. so, let's merge :-)

jflesch commented 7 years ago

Thank you again :)

ddddavidmartin commented 7 years ago

Hah, thanks! That was quick :)