openpaperwork / pyocr

A Python wrapper for Tesseract and Cuneiform -- Moved to Gnome's Gitlab
https://gitlab.gnome.org/World/OpenPaperwork/pyocr
931 stars 152 forks source link

Add support for Tesseract version 3.05.00 #62

Closed aszlig closed 7 years ago

aszlig commented 7 years ago

This is a bit more involved, because Tesseract 3.05.00 comes not only with improvements but also with a few quirks we need to deal with.

The first quirk is that the order arguments of the tesseract command now matters and the list of configurations has to be at the end of the command line. So we add a new attribute tesseract_flags to the BaseBuilder class that contains a list of all the flags to pass to tesseract, the tesseract_configs attribute however remains pretty much the same but now only really contains a list of configs instead of being mixed with flag arguments.

Another quirk has to do with Leptonica >= 1.74 which Tesseract 3.05.00 now requires. Leptonica has special handling of files that reside in /tmp and assumes that it's an internal temporary file of Leptonica. In order to deal with it, we now run Tesseract in a temporary directory, which contains the input/output files and use the relative name of these files because Leptonica only searches for path names beginning with /tmp.

Fortunately the last item we need to address is not really a quirk, but an API change. In Tesseract 3.05.00 there is now a new function called TessBaseAPIDetectOrientationScript(), which doesn't fill the OSResults object anymore but now allows to pass the values we're interested in directly by reference. We need to use this new function because the old function TessBaseAPIDetectOS() now always returns false.

Ran the test suite successfully with Python 3.5 and both Tesseract 3.04.01 and 3.05.00 except the following tests, which also didn't succeed prior to this commit:

The failure of these test cases is probably related to issue #52, but from looking at the failures it doesn't seem to be related to this change anyway.

jflesch commented 7 years ago

Really good contribution. Thanks :-) And thank you also for taking care of not breaking Tesseract 3.04 support.

I'll test it later with Tesseract 3.05 and make a new release (maybe next week I hope).

<rant>

Leptonica has special handling of files that reside in /tmp

Yeah, my first question would be "what the f*ck?" .. but I guess they must have their (weird) reasons, and there is nothing we can do about it.

We need to use this new function because the old function TessBaseAPIDetectOS() now always returns false.

This is the second time I see that @tesseract-ocr breaks the C API in such way. This is getting frustrating. They could simply mark the old function obsolete, and call the new one from the old one. Since we are using Python, it's easy for us to remain compatible, but I seriously wonder how C developers are supposed to handle this kind of changes without using dlopen()&friends.

</rant>

Anyway, thank you again for this great contribution. I'll merge it right now :-)

aszlig commented 7 years ago

@jflesch: I guess C programmers simply didn't use the TessBaseAPIDetectOS function because it requires an OSResults C++ object, which is why they changed the API in the first place. Apart from that, if I'd want to use the API from C code, I'd handle that within the build and use the preprocessor to handle the different cases based on the version from pkg-config.

jflesch commented 7 years ago

@aszlig Good points. I totally forgot it was actually a C++ object.

Regarding pkg-config however, there is no .pc file with Libtesseract 3.03 (in debian at least), so it isn't a valid option if you want to support tesseract 3.04. Also, even if there would be one, it would still break the build silently. Tests could (and should) catch it, but still, it's bad practice to break an API when it can be easily avoided.

jflesch commented 7 years ago

Included in Pyocr 0.4.7