openpaperwork / pyocr

A Python wrapper for Tesseract and Cuneiform -- Moved to Gnome's Gitlab
https://gitlab.gnome.org/World/OpenPaperwork/pyocr
931 stars 152 forks source link

[Fix]: Force use of legacy model for OSD #85

Closed a-pagano closed 6 years ago

a-pagano commented 6 years ago

Orientation and script detection (OSD) doesn't work yet with the new LSTM models shipped with Tesseract 4.

Calling detect_orientation will thus fail when using tesseract 4.0 and both legacy and LSTM models are found in the TESSDATA_PREFIX directory.

The fix is to force the use of the old legacy model until support for OSD is available in the new models.

jflesch commented 6 years ago

First of all, thanks for taking the time to contribute.

However, done like that, it will break support of Tesseract 3 (and I need this support for Paperwork ;). You can use get_version() in a way similar to what can_detect_orientation() does:

command = [TESSERACT_CMD, "input.bmp", 'stdout', "-psm", "0"]
version = get_version()
if version[0] >= 4:
    # XXX: temporary fix to remove once Tesseract 4 is stable
    command += ["--oem", "0"]
a-pagano commented 6 years ago

Good catch! Didn't think of that...

I've implemented the changes you suggested :)

jflesch commented 6 years ago

Perfect, thanks :-)