openpaperwork / pyocr

A Python wrapper for Tesseract and Cuneiform -- Moved to Gnome's Gitlab
https://gitlab.gnome.org/World/OpenPaperwork/pyocr
930 stars 152 forks source link

Libtesseract: need stress-testing #51

Open jflesch opened 7 years ago

jflesch commented 7 years ago

Someone has been reporting crashes of Paperwork when running the OCR. They are using Tesseract 3.04.01 .. so there may be something wrong with the libtesseract binding.

(Note: currently, the preference order has been changed so Pyocr uses tesseract-sh if possible)

ghost commented 7 years ago

Getting occasional segfaults when using the pyocr.libtesseract tool. Can't pinpoint an exact repeatable cause. Will update if a pattern that triggers the segfault is found.

The other segfault occurs when there is no language data. This one is consistent. screenshot from 2017-03-22 02-05-36

jflesch commented 7 years ago

If you find a pattern, that would be awesome :-)

I note for the no-language crash. I'll have a look asap (probably this week-end I hope).

jflesch commented 7 years ago

BTW, can you tell me which version of Tesseract you use please ?

jflesch commented 7 years ago

no-language crash:

ghost commented 7 years ago

Tesseract version is 3.04.01 from Ubuntu's 3.04.01-4build1

Thanks for the fix.

We lowered Mayan EDMS (http://www.mayan-edms.com) memory footprint by switching to pyocr's libtesseract, thanks for that too :)

jflesch commented 7 years ago

You're welcome :)

jflesch commented 7 years ago

Hm, maybe the crashes were due to a hack: TessBaseAPIDetectOS() was actually a C++ function. I was using ctypes to access it .. and let just say it's not designed for C++, so it is/was a bit hacky. It may have been the cause of crashes on some systems. Tesseract 3.05.00 included a new replacement function TessBaseAPIDetectOrientationScript() that is pure C. @aszlig added support for this new function.

I think I will try to switch libtesseract back as default once Tesseract 4 is out.