openpaperwork / pyocr

A Python wrapper for Tesseract and Cuneiform -- Moved to Gnome's Gitlab
https://gitlab.gnome.org/World/OpenPaperwork/pyocr
930 stars 152 forks source link

Failed tests #3

Closed tnorth closed 11 years ago

tnorth commented 12 years ago

Hi again,

Some tests fail for me (fedora 16), with tesseract-ocr 3.00 probably because the version is not up to date (3.01), and also for cuneiform 1.1.0, for some reason. Looks like 1.1.0 is the latest one.

Here is the output related to cuneiform:

$ python run_tests.py 
- OCR: Tesseract
  is_available(): True
  get_version(): (3, 0, 0)
  get_available_languages():  
- OCR: Cuneiform
  is_available(): True
  get_version(): (1, 1, 0)
  get_available_languages():  eng,  ger,  fra,  rus,  swe,  spa,  ita,  ruseng,  ukr,  srp,  hrv,  pol,  dan,  por,  dut,  cze,  rum,  hun,  bul,  slv,  lav,  lit,  est,  tur,  

OCR tool found:
- Tesseract
- Cuneiform

---
Tesseract:
.FF..FFFF.EEEE

[snap old tesseract]

Cuneiform:
.....F..F.
======================================================================
FAIL: test_french (tests.tests_cuneiform.TestTxt)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/pyocr/tests/tests_cuneiform.py", line 73, in test_french
    self.__test_txt('test-french.jpg', 'test-french.txt', 'fra')
  File "/tmp/pyocr/tests/tests_cuneiform.py", line 64, in __test_txt
    self.assertEqual(output, expected_output)
AssertionError: u'Phrase en *an\xe7ais. \navec des accents \n\xe9ph\xe9m\xe8re' != u'Phrase en fran\xe7ais. \navec des accents \n\xe9ph\xe9m\xe8re'
- Phrase en *an\xe7ais. 
?           ^
+ Phrase en fran\xe7ais. 
?           ^^
  avec des accents 
  \xe9ph\xe9m\xe8re

======================================================================
FAIL: test_french (tests.tests_cuneiform.TestWordBox)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/pyocr/tests/tests_cuneiform.py", line 113, in test_french
    self.__test_txt('test-french.jpg', 'test-french.words', 'fra')
  File "/tmp/pyocr/tests/tests_cuneiform.py", line 104, in __test_txt
    self.assertEqual(boxes[i], expected_boxes[i])
AssertionError: <builders.Box object at 0x2558650> != <builders.Box object at 0x24d0f90>

----------------------------------------------------------------------
Ran 10 tests in 2.248s

FAILED (failures=2)
jflesch commented 12 years ago

For Tesseract, if the version doesn't match exactly the one used to write the tests, this result is to be expected. Anyway, as far as I know, Tesseract 3.00 should work fine, even if the tests don't pass, so no worry here. In case you plan on sending a patch or writing new tests, I think you will have to install Tesseract 3.01 from its source ( http://code.google.com/p/tesseract-ocr/ )

For Cuneiform, it's a bit more curious. It seems we have the same version, and the tests pass correctly on my end. I think the problem is basically we are not using the same Linux distribution (I'm using Debian testing), so there may some patchs included in one version and not in the other. Also the training data may be slightly different.

jflesch commented 12 years ago

By the way, for Cuneiform, I wouldn't worry too much either. Test results show that it worked. It's just the result is slightly different (and slightly less good) than the one expected by the tests.

tnorth commented 12 years ago

Ok, thanks for looking into it. Indeed, it works.