qurator-spk / dinglehopper

An OCR evaluation tool
Apache License 2.0
59 stars 13 forks source link

dinglehopper keep hanging and test errors #65

Closed whisere closed 2 years ago

whisere commented 2 years ago

running dinglehopper gt txt and dinglehopper-line-dirs keep hanging without message, and pytest returns errors:

collected 62 items / 18 deselected / 44 selected                                                   

qurator/dinglehopper/tests/extracted_text_test.py .............                              [ 29%]
qurator/dinglehopper/tests/test_align.py .......F..                                          [ 52%]
qurator/dinglehopper/tests/test_character_error_rate.py ..                                   [ 56%]
qurator/dinglehopper/tests/test_edit_distance.py .                                           [ 59%]
qurator/dinglehopper/tests/test_editops.py ..                                                [ 63%]
qurator/dinglehopper/tests/test_ocr_files.py .............                                   [ 93%]
qurator/dinglehopper/tests/test_word_error_rate.py ...                                       [100%]

============================================= FAILURES =============================================
__________________________________ test_with_some_fake_ocr_errors __________________________________

    def test_with_some_fake_ocr_errors():
>       result = list(
            align(
                "Über die vielen Sorgen wegen desselben vergaß",
                "SomeJunk MoreJunk Übey die vielen Sorgen wegen AdditionalJunk deffelben vcrgab",
            )
        )

qurator/dinglehopper/tests/test_align.py:70: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

s1 = ['Ü', 'b', 'e', 'r', ' ', 'd', ...], s2 = ['S', 'o', 'm', 'e', 'J', 'u', ...]

    def seq_align(s1, s2):
        """Align general sequences."""
        s1 = list(s1)
        s2 = list(s2)
        ops = levenshtein_editops(s1, s2)
        i = 0
        j = 0

        while i < len(s1) or j < len(s2):
            o = None
            try:
                ot = ops[0]
                if ot[1] == i and ot[2] == j:
                    ops = ops[1:]
                    o = ot
            except IndexError:
                pass

            if o:
                if o[0] == "insert":
                    yield None, s2[j]
                    j += 1
                elif o[0] == "delete":
                    yield s1[i], None
                    i += 1
                elif o[0] == "replace":
                    yield s1[i], s2[j]
                    i += 1
                    j += 1
            else:
>               yield s1[i], s2[j]
E               IndexError: list index out of range

qurator/dinglehopper/align.py:42: IndexError
===================================== short test summary info ======================================
FAILED qurator/dinglehopper/tests/test_align.py::test_with_some_fake_ocr_errors - IndexError: lis...
=========================== 1 failed, 43 passed, 18 deselected in 30.24s ===========================

also stuck with: qurator/dinglehopper/tests/test_integ_table_extraction.py ..... [ 83%] qurator/dinglehopper/tests/test_integ_word_error_rate_ocr.py ..

python version 3.9.0. Thanks.

whisere commented 2 years ago

Also tried on python 3.10.0, 3.8.9, 3.6.15, they are all the same.

mikegerber commented 2 years ago

I can't reproduce and tested a fresh install on Python 3.9. Could you please provide the full output of your pytest call? This would include more useful information e.g. the platform.

mikegerber commented 2 years ago

There is another problem with rapidfuzz which leads to tests getting stuck on qurator/dinglehopper/tests/test_integ_ocrd_cli.py (and with the pytest process consuming 100%). This is fixed with downgrading to pip install rapidfuzz==1.9.1.

@maxbachmann Any idea how to debug this properly? Reproducer would be using Python 3.9, installing dinglehopper with rapidfuzz 2.0.4 (including both requirements*.txt) and running

% pytest -k test_integ_ocrd_cli.py          
==================================================================== test session starts ====================================================================
platform linux -- Python 3.9.10, pytest-7.0.1, pluggy-1.0.0
rootdir: /home/mike/devel/dinglehopper-github, configfile: pytest.ini
plugins: flake8-1.0.7, cov-3.0.0, mypy-0.9.1
collected 62 items / 61 deselected / 1 selected                                                                                                             

qurator/dinglehopper/tests/test_integ_ocrd_cli.py .                                                                                                   [100%]

============================================================= 1 passed, 61 deselected in 1.14s ==============================================================
% pip install -U rapidfuzz
Requirement already satisfied: rapidfuzz in /home/mike/.virtualenvs/dinglehopper-github/lib64/python3.9/site-packages (1.9.1)
Collecting rapidfuzz
  Using cached rapidfuzz-2.0.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)
Installing collected packages: rapidfuzz
  Attempting uninstall: rapidfuzz
    Found existing installation: rapidfuzz 1.9.1
    Uninstalling rapidfuzz-1.9.1:
      Successfully uninstalled rapidfuzz-1.9.1
Successfully installed rapidfuzz-2.0.4
% pytest -k test_integ_ocrd_cli.py 
==================================================================== test session starts ====================================================================
platform linux -- Python 3.9.10, pytest-7.0.1, pluggy-1.0.0
rootdir: /home/mike/devel/dinglehopper-github, configfile: pytest.ini
plugins: flake8-1.0.7, cov-3.0.0, mypy-0.9.1
collected 62 items / 61 deselected / 1 selected                                                                                                             

qurator/dinglehopper/tests/test_integ_ocrd_cli.py ^Z
[1]  + 521125 suspended  pytest -k test_integ_ocrd_cli.py
% kill %1
[1]  + 521125 terminated  pytest -k test_integ_ocrd_cli.py

(First call using 1.9.1 runs fine, second using 2.0.4 hangs)

mikegerber commented 2 years ago

rapidfuzz had a new release 19 hours ago that has a bugfix for relevant code, make sure you have rapidfuzz 2.0.4+!

% pip list | grep rapidfuzz
rapidfuzz              2.0.4

Sorry, downgrade! pip install rapidfuzz==1.9.1

maxbachmann commented 2 years ago

@mikegerber I can reproduce the issue and will look into it.

maxbachmann commented 2 years ago

I tracked down a small reproducing sample:

from rapidfuzz import string_metric

a = [2425437992138244740]
b = [-4086774168534702970]

string_metric.levenshtein_editops(a, b)
maxbachmann commented 2 years ago

Apparently I replaced uint64_t with int64_t in one to many places, which did lead to signed integer overflows inside the hashmap implementation. This is fixed by https://github.com/maxbachmann/rapidfuzz-cpp/commit/fadfb752d5f90e35e48d20ceabdde44b52c81c9e. This is fixed in v2.0.5.

whisere commented 2 years ago

dinglehopper gt ocr is not hanging after running pip install rapidfuzz==2.0.5 Thanks!

pytest reported: E ModuleNotFoundError: No module named 'qurator.dinglehopper.tests' Hint: make sure your test modules/packages have valid Python names. ===================================== short test summary info ====================================== ERROR qurator/dinglehopper/tests/extracted_text_test.py ERROR qurator/dinglehopper/tests/test_align.py ERROR qurator/dinglehopper/tests/test_character_error_rate.py ERROR qurator/dinglehopper/tests/test_edit_distance.py ERROR qurator/dinglehopper/tests/test_editops.py ERROR qurator/dinglehopper/tests/test_integ_align.py ERROR qurator/dinglehopper/tests/test_integ_character_error_rate_ocr.py ERROR qurator/dinglehopper/tests/test_integ_cli_valid_json.py ERROR qurator/dinglehopper/tests/test_integ_edit_distance_ocr.py ERROR qurator/dinglehopper/tests/test_integ_ocrd_cli.py ERROR qurator/dinglehopper/tests/test_integ_table_extraction.py ERROR qurator/dinglehopper/tests/test_integ_word_error_rate_ocr.py ERROR qurator/dinglehopper/tests/test_ocr_files.py ERROR qurator/dinglehopper/tests/test_word_error_rate.py !!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 14 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!

under python 3.9.0. I guess it doesn't matter since dinglehopper is running okay? Thanks.

mikegerber commented 2 years ago

dinglehopper gt ocr is not hanging after running pip install rapidfuzz==2.0.5 Thanks!

Great! I'm bumping the dependency to >= 2.0.5.

pytest reported: E ModuleNotFoundError: No module named 'qurator.dinglehopper.tests'

That's a different problem. Did you follow the instructions in README-DEV.txt?

mikegerber commented 2 years ago

Apparently I replaced uint64_t with int64_t in one to many places, which did lead to signed integer overflows inside the hashmap implementation. This is fixed by maxbachmann/rapidfuzz-cpp@fadfb75. This is fixed in v2.0.5.

This update also fixes my tests, great!