Closed ddddavidmartin closed 7 years ago
Attached is the test output when run with tesseract 4 for what it is worth. test_output.txt
First of all, thank you for this contribution.
However, pretty much all the tests are failing because they can't find the data language files:
pyocr.error.TesseractError: (1, b'Error opening data file /usr/local/share/fra.traineddata\n
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata"
directory.\n
Failed loading language \'fra\'\nTesseract couldn\'t load any languages!\nCould not
initialize tesseract.\n')
You need French and Japanese data files (fra and jpn). I would feel more confident if you could pass these tests too before I merge your changes.
Actually, after reading your changes, they pretty much can't make things worse .. so, let's merge :-)
Thank you again :)
Hah, thanks! That was quick :)
The current tesseract 4.0 version is still in alpha and returns the version string
tesseract 4.00.00alpha
. This breaks the existingget_version
function as it expects integer values only.To work around it this pull request simply only takes the starting digits of the version and returns these.
Note: I haven't really tried out how pyocr fares with tesseract 4. But, I am using it with paperless and it seems to be working fine for me so far.
How to test this:
pyocr.error.TesseractError: (0, 'Unable to parse Tesseract version (not a number): [4.00.00alpha]')