sirfz / tesserocr

A Python wrapper for the tesseract-ocr API
MIT License
1.99k stars 255 forks source link

Tesseract 5.0.1 test_LSTM_choices(...) fails #295

Open simonflueckiger opened 2 years ago

simonflueckiger commented 2 years ago

I compiled tesserocr 2.5.2 with Tesseract 5.0.1 on Windows. When executing tesserocr\tests\test_api.py I get the following exception for test_LSTM_choices(...):

FAIL: test_LSTM_choices (tests.test_api.TestTessBaseApi)
Test GetBestLSTMSymbolChoices.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tesserocr\tests\test_api.py", line 201, in test_LSTM_choices
    self.assertLessEqual(alternative[1], 2.0)
AssertionError: 3.621181011199951 not less than or equal to 2.0

Very similar to this https://github.com/sirfz/tesserocr/pull/147#discussion_r342202823. It passes when built with Tesseract 4.1.3. Does this also pass on Travis for Tesseract 5.x? I get a 404 when trying to access the build pipeline.

srinivas1746 commented 2 years ago

Hi @simonflueckiger , can you please provide the instructions on how to build tesserocr with tesseract 5 .

simonflueckiger commented 2 years ago

@srinivas1746 you can have a look at my appveyor.yml from my repository tesserocr-windows_build.

srinivas1746 commented 2 years ago

Hi @simonflueckiger , thank you for the update. In my case I want to build on UBUNTU system. Please let me know if there is any way?

simonflueckiger commented 2 years ago

CMake and vcpkg work very similarly on Linux, so if you want to build Tesseract 5.x from source you can use most of appveyor.yml with minor adaptations. To build tesserocr on Linux please refer to this README.

srinivas1746 commented 2 years ago

@simonflueckiger thank you for you respose. I am able to build tesserocr + tesseract 5 with instructions in this link. https://github.com/sirfz/tesserocr/blob/master/Windows.build.md#tesseract-build-and-installation

simonflueckiger commented 2 years ago

@sirfz has this been changed/fixed in the meantime?

sirfz commented 2 years ago

@simonflueckiger I saw the last comment about the working windows build, now I realize the issue is unrelated so I'm reopening it. Anyone with knowledge about GetBestLSTMSymbolChoices can chip in with the proper "fix" for this test case?

makra89 commented 2 years ago

I also encounter a strange behavior when using tesserocr 2.5.2 together with tesseract 5.0.1 Besides the OCR result itself I also collect the LSTM symbol choices via GetBestLSTMSymbolChoices() to use them in a separate trie. This works fine for tesserocr 2.5.1 and tesseract 4

After the switch I get strange confidence values, the symbol choices themselves seem to be okay. An example:

sirfz commented 2 years ago

I think we should change the test case, we shouldn't be testing tesseract's correctness, instead just test that tesserocr's API wrapping works.

It's not wrong to get different results (especially confidence scores) from different models/versions (like in this case with tesseract 5 vs 4) so we shouldn't expect it to be a static value or range

makra89 commented 2 years ago

They scores didn't just change, I think they are wrong. If you look at the values I posted above, you can see that they increase with getting more unlikely. I first thought this could be a log-prob., but this neither makes sense.

What you are saying is that this may be a bug in Tesseract itself?

ngeraks commented 1 year ago

popos 22.04, tesseract 5.2, tesserocr main branch

======================================================================
ERROR: test_init (tests.test_api.TestTessBaseApi)
Test Init calls with different lang and oem.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ni/tesserocr/tests/test_api.py", line 98, in test_init
    self._api.Init(oem=tesserocr.OEM.TESSERACT_ONLY)
  File "tesserocr.pyx", line 1485, in tesserocr.PyTessBaseAPI.Init
    self._init_api(cpath, clang, oem, NULL, 0, NULL, NULL, False, PSM_AUTO)
  File "tesserocr.pyx", line 1233, in tesserocr.PyTessBaseAPI._init_api
    raise RuntimeError('Failed to init API, possibly an invalid tessdata path: {}'.format(path))
RuntimeError: Failed to init API, possibly an invalid tessdata path: /home/ni/tesseract-main/share/tessdata/

======================================================================
FAIL: test_LSTM_choices (tests.test_api.TestTessBaseApi)
Test GetBestLSTMSymbolChoices.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ni/tesserocr/tests/test_api.py", line 201, in test_LSTM_choices
    self.assertLessEqual(alternative[1], 2.0)
AssertionError: 3.578106641769409 not less than or equal to 2.0

======================================================================
FAIL: test_detect_os (tests.test_api.TestTessBaseApi)
Test DetectOS and DetectOrientationScript (tesseract v4+).
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ni/tesserocr/tests/test_api.py", line 235, in test_detect_os
    self.assertEqual(orientation["orientation"], 0)
AssertionError: 3 != 0

----------------------------------------------------------------------
Ran 24 tests in 7.739s

FAILED (failures=2, errors=1)
Test failed: <unittest.runner.TextTestResult run=24 errors=1 failures=2>
error: Test failed: <unittest.runner.TextTestResult run=24 errors=1 failures=2>

reinstalled and tried to use tesseract 4.1

======================================================================
ERROR: test_LSTM_choices (tests.test_api.TestTessBaseApi)
Test GetBestLSTMSymbolChoices.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ni/tesserocr/tests/test_api.py", line 202, in test_LSTM_choices
    chosen_symbol = timestep[0][0]
IndexError: list index out of range

======================================================================
ERROR: test_init (tests.test_api.TestTessBaseApi)
Test Init calls with different lang and oem.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ni/tesserocr/tests/test_api.py", line 98, in test_init
    self._api.Init(oem=tesserocr.OEM.TESSERACT_ONLY)
  File "tesserocr.pyx", line 1485, in tesserocr.PyTessBaseAPI.Init
    self._init_api(cpath, clang, oem, NULL, 0, NULL, NULL, False, PSM_AUTO)
  File "tesserocr.pyx", line 1245, in tesserocr.PyTessBaseAPI._init_api
    raise RuntimeError('Failed to init API, possibly an invalid tessdata path: {}'.format(path))
RuntimeError: Failed to init API, possibly an invalid tessdata path: /home/ni/tesseract-4.1/share/tessdata/

======================================================================
FAIL: test_detect_os (tests.test_api.TestTessBaseApi)
Test DetectOS and DetectOrientationScript (tesseract v4+).
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ni/tesserocr/tests/test_api.py", line 235, in test_detect_os
    self.assertEqual(orientation["orientation"], 0)
AssertionError: 3 != 0

----------------------------------------------------------------------
makra89 commented 1 year ago

This seems to be an issue with a changed API in tesseract itself. See https://github.com/tesseract-ocr/tesseract/issues/3706

In order to convert the scores to confidences you have to first fetch a variable: lstm_rating = self._api.GetDoubleVariable('lstm_rating_coefficient')

Then you get the scores from tesserocr: character_candidates = api.GetBestLSTMSymbolChoices()

Then you can convert the scores: conf = 100 - lstm_rating * score