openpaperwork / pyocr

A Python wrapper for Tesseract and Cuneiform -- Moved to Gnome's Gitlab
https://gitlab.gnome.org/World/OpenPaperwork/pyocr
930 stars 152 forks source link

Tests/Tesseract: Differences between Pyocr and reference output #52

Open jflesch opened 7 years ago

jflesch commented 7 years ago

For some reason there are differences between the references and the actual results. And it seem the actual results are good, so it's problably a bug in update_test_data.sh

QuLogic commented 6 years ago

Which tests are known failures? Can they be marked as such at least?

On Fedora with tesseract 3.05.01-1 and cuneiform 1.1.0-25, I get the following failures:

tests.tests_cuneiform.TestTxt:test_french
tests.tests_cuneiform.TestWordBox:test_basic
tests.tests_cuneiform.TestWordBox:test_european
tests.tests_cuneiform.TestWordBox:test_french

tests.tests_libtesseract.TestBasicDoc:test_basic
tests.tests_libtesseract.TestContext:test_version
tests.tests_libtesseract.TestDigitLineBox:test_digits
tests.tests_libtesseract.TestLineBox:test_japanese
tests.tests_libtesseract.TestTxt:test_basic
tests.tests_libtesseract.TestTxt:test_european
tests.tests_libtesseract.TestTxt:test_japanese
tests.tests_libtesseract.TestTxt:test_multi
tests.tests_libtesseract.TestWordBox:test_japanese

tests.tests_tesseract.TestContext:test_version
tests.tests_tesseract.TestDigitLineBox:test_digits
tests.tests_tesseract.TestTxt:test_basic
tests.tests_tesseract.TestTxt:test_european
tests.tests_tesseract.TestTxt:test_japanese
tests.tests_tesseract.TestTxt:test_multi

Compared to NixOS in the linked commits, that means I get a slightly better working cuneiform but tesseract fails the basic, european and multi tests. The basic test seems to have some odd character twiddling with "ocr" vs "cor".

jflesch commented 6 years ago

This is going to be tricky. It depends on the exact version of Tesseract, the exact compilation options of Tesseract and Liblept. I haven't yet found a way to avoid having to control manually the results each time :(

(Note that this is not the topic of this ticket).

QuLogic commented 6 years ago

Building tesseract 3.05.00 from source (and using leptonica binaries), at least the basic tests pass again. Since the expected results are correct (at least as a human would read them), I think it might be a regression with tesseract. I will bisect and see if I can figure out what's up there.

QuLogic commented 6 years ago

So at least the new failure is tesseract-ocr/tesseract#1253; at what point did the remaining tests pass?

jflesch commented 6 years ago

Quite frankly, I don't even remember the last time I was able to pass successfully all the tests at once :/