madmaze / pytesseract

A Python wrapper for Google Tesseract
Apache License 2.0
5.86k stars 720 forks source link

tests/pytesseract_test.py::test_image_to_data_common_output[dict] FAILED #406

Closed mandree closed 2 years ago

mandree commented 2 years ago

Hello namesake,

the self-test suite fails on FreeBSD for pytesseract 0.3.8 and 0.3.9 with various Python 3.x versions,

>           assert 0 <= confidence_values[-1] <= 100
E           TypeError: '<=' not supported between instances of 'int' and 'str'

Full story:

GLOB sdist-make: /usr/ports/graphics/py-pytesseract/work-py39/pytesseract-0.3.9/setup.py
py39 create: /usr/ports/graphics/py-pytesseract/work-py39/pytesseract-0.3.9/.tox/py39
py39 installdeps: numpy, pandas, -r/usr/ports/graphics/py-pytesseract/work-py39/pytesseract-0.3.9/requirements-dev.txt
py39 inst: /usr/ports/graphics/py-pytesseract/work-py39/pytesseract-0.3.9/.tox/.tmp/package/1/pytesseract-0.3.9.zip
py39 installed: acme==1.22.0,affine==2.3.0,alabaster==0.7.12,appdirs==1.4.4,asn1crypto==1.4.0,astroid==2.9.0,atomicwrites==1.4.0,attrs==21.3.0,Babel==2.9.1,black==21.12b0,blinker==1.4,boto==2.49.0,Bottleneck==1.3.2,breathe==4.31.0,certbot==1.22.0,certifi==2021.10.8,cffi==1.15.0,cfgv==3.3.1,chardet==4.0.0,chrome-gnome-shell==0.0.0,click==8.0.3,click-plugins==1.1.1,cligj==0.7.2,cloudpickle==1.3.0,colorama==0.4.4,commonmark==0.9.1,ConfigArgParse==1.5.3,configobj==5.0.6,coverage==4.5.4,cryptography==3.3.2,cssselect==1.1.0,cycler==0.11.0,Cython==0.29.26,decorator==5.1.1,deprecation==2.1.0,distlib==0.3.4,distro==1.6.0,dnspython==2.1.0,docutils==0.17.1,entrypoints==0.3,evdev==1.4.0,eyed3==0.9.6,fastest-pkg==0.2.0,filelock==3.4.2,filetype==1.0.7,Fiona==1.8.20,Flask==2.0.2,Flask-WTF==0.15.1,freezegun==1.0.0,future==0.18.2,GDAL==3.3.3,geojson==2.3.0,geopandas==0.10.2,httplib2==0.20.2,icdiff==2.0.4,identify==2.4.5,idna==2.10,imageio==2.9.0,imageio-ffmpeg==0.4.5,imagesize==1.3.0,importlib-metadata==4.8.1,importlib-resources==5.4.0,incremental==21.3.0,iniconfig==0.0.0,ipython-genutils==0.2.0,iso-639==0.4.5,iso3166==1.0.1,iso8601==0.1.16,isodate==0.6.1,isort==5.10.1,itsdangerous==2.0.1,jedi==0.18.0,jeepney==0.7.1,Jinja2==3.0.1,joblib==1.1.0,josepy==1.11.0,jq==1.2.1,jsonpatch==1.21,jsonpointer==2.0,jsonschema==4.2.1,jupyter-core==4.9.1,keyring==18.0.1,keyrings.alt==3.1.1,kiwisolver==1.3.2,lazy-object-proxy==1.7.1,lensfun==0.3.95,libxml2-python==2.9.12,lxml==4.7.1,Markdown==3.3.4,MarkupSafe==2.0.1,matplotlib==3.4.3,matplotlib-scalebar==0.8.0,mccabe==0.6.1,meson==0.60.3,minidb==2.0.5,mock==3.0.5,mongoengine==0.20.0,more-itertools==8.12.0,munch==2.5.0,mutagen==1.45.1,mypy-extensions==0.4.3,nbformat==5.1.3,networkx==2.6.3,nltk==3.4.1,nodeenv==1.6.0,nose==1.3.7,numexpr==2.8.1,numpy==1.20.3,oauthlib==1.1.2,olefile==0.46,OWSLib==0.25.0,packaging==21.3,pafy==0.5.5,pandas==1.2.5,parsedatetime==2.6,parso==0.8.3,pathspec==0.9.0,pbr==5.5.0,pdftotext==2.2.2,Pillow==8.2.0,pkginfo==1.8.2,platformdirs==2.4.1,plotly==4.14.3,pluggy==0.13.1,ply==3.11,pre-commit==2.17.0,psutil==5.8.0,psycopg2==2.9.2,pwquality==1.4.4,py==1.9.0,pybind11==2.9.0,pycairo==1.18.1,pycodestyle==2.8.0,pycountry==18.5.26,pycparser==2.21,pycryptodome==3.12.0,pycurl==7.44.1,pydot==1.4.2,Pygments==2.7.2,PyGObject==3.38.0,pygraphviz==1.6,pyjq==2.4.0,PyJWT==2.3.0,pylint==2.12.2,pymongo==3.12.0,pyOpenSSL==20.0.1,pypa-docs-theme==0.0.1,pyparsing==3.0.6,pypng==0.0.17,pyproj==3.2.1,PyQRCode==1.2.1,PyQt-builder==1.9.1,PyQt5-sip==12.9.0,pyRFC3339==1.1,pyrsgis==0.4.1,pyrsistent==0.14.11,pyserial==3.5,PySocks==1.7.1,PyStemmer==2.0.1,pytesseract==0.3.9,pytest==4.6.11,python-dateutil==2.8.1,python-docs-theme==2018.2,python-magic==0.4.15,pytz==2021.3,pyudev==0.22.0,PyWavelets==1.2.0,pyxdg==0.27,PyYAML==5.4.1,QScintilla==2.13.0,rasterio==1.2.10,recommonmark==0.5.0,regex==2020.7.14,requests==2.25.1,requests-mock==1.9.3,requests-toolbelt==0.9.1,retrying==1.3.3,scikit-image==0.19.1,scikit-learn==1.0.2,scikit-sparse==0.4.6,scipy==1.7.1,SCons==4.2.0,SecretStorage==3.3.1,setuptools-scm==6.3.2,Shapely==1.8.0,simplejson==3.17.6,sip==5.5.0,six==1.16.0,snowballstemmer==2.2.0,snuggs==1.4.7,Sphinx==4.3.1,sphinx-markdown-tables==0.0.15,sphinx-rtd-theme==1.0.0,sphinxcontrib-applehelp==1.0.2,sphinxcontrib-devhelp==1.0.2,sphinxcontrib-htmlhelp==2.0.0,sphinxcontrib-jsmath==1.0.1,sphinxcontrib-qthelp==1.0.3,sphinxcontrib-serializinghtml==1.1.5,sphinxcontrib-websupport==1.2.4,sqlite3==0.0.0,streamlink==2.1.2,termcolor==1.1.0,tifffile==2021.8.30,Tkinter==0.0.0,toml==0.10.2,tomli==1.2.3,tornado==6.1,towncrier==19.2.0,tox==3.12.1,tqdm==4.62.3,traitlets==5.1.1,typed-ast==1.5.1,typing-extensions==3.10.0.2,urllib3==1.26.7,urlwatch==2.24,urwid==2.1.2,urwid-readline==0.13,vcversioner==2.16.0.0,virtualenv==20.13.0,wcwidth==0.1.8,webencodings==0.5.1,websocket-client==0.58.0,websockets==10.1,Werkzeug==2.0.2,wrapt==1.13.3,WTForms==2.1,wxPython==4.0.7,xlrd==2.0.1,xmltodict==0.12.0,ydiff==1.2,zipp==3.4.0,zope.component==4.2.2,zope.event==4.1.0,zope.interface==5.3.0
py39 run-test-pre: PYTHONHASHSEED='2942057313'
py39 run-test: commands[0] | python -bb -m pytest tests
======================================================= test session starts ========================================================
platform freebsd13 -- Python 3.8.12, pytest-4.6.11, py-1.9.0, pluggy-0.13.1 -- /usr/ports/graphics/py-pytesseract/work-py39/pytesseract-0.3.9/.tox/py39/bin/python
cachedir: .tox/py39/.pytest_cache
rootdir: /usr/ports/graphics/py-pytesseract/work-py39/pytesseract-0.3.9, inifile: tox.ini
plugins: requests-mock-1.9.3
collected 47 items                                                                                                                 

tests/pytesseract_test.py::test_image_to_string_with_image_type[jpg] PASSED                                                  [  2%]
tests/pytesseract_test.py::test_image_to_string_with_image_type[pgm] PASSED                                                  [  4%]
tests/pytesseract_test.py::test_image_to_string_with_image_type[png] PASSED                                                  [  6%]
tests/pytesseract_test.py::test_image_to_string_with_image_type[ppm] PASSED                                                  [  8%]
tests/pytesseract_test.py::test_image_to_string_with_image_type[tiff] PASSED                                                 [ 10%]
tests/pytesseract_test.py::test_image_to_string_with_image_type[gif] PASSED                                                  [ 12%]
tests/pytesseract_test.py::test_image_to_string_with_image_type[webp] PASSED                                                 [ 14%]
tests/pytesseract_test.py::test_image_to_string_with_args_type[path_str] PASSED                                              [ 17%]
tests/pytesseract_test.py::test_image_to_string_with_args_type[image_object] PASSED                                          [ 19%]
tests/pytesseract_test.py::test_image_to_string_with_numpy_array PASSED                                                      [ 21%]
tests/pytesseract_test.py::test_image_to_string_european PASSED                                                              [ 23%]
tests/pytesseract_test.py::test_image_to_string_batch PASSED                                                                 [ 25%]
tests/pytesseract_test.py::test_image_to_string_multiprocessing PASSED                                                       [ 27%]
tests/pytesseract_test.py::test_image_to_string_timeout PASSED                                                               [ 29%]
tests/pytesseract_test.py::test_la_image_to_string PASSED                                                                    [ 31%]
tests/pytesseract_test.py::test_image_to_boxes PASSED                                                                        [ 34%]
tests/pytesseract_test.py::test_image_to_osd PASSED                                                                          [ 36%]
tests/pytesseract_test.py::test_image_to_pdf_or_hocr[pdf] PASSED                                                             [ 38%]
tests/pytesseract_test.py::test_image_to_pdf_or_hocr[hocr] PASSED                                                            [ 40%]
tests/pytesseract_test.py::test_image_to_alto_xml PASSED                                                                     [ 42%]
tests/pytesseract_test.py::test_image_to_alto_xml_support SKIPPED                                                            [ 44%]
tests/pytesseract_test.py::test_image_to_data__pandas_support SKIPPED                                                        [ 46%]
tests/pytesseract_test.py::test_image_to_data__pandas_output PASSED                                                          [ 48%]
tests/pytesseract_test.py::test_image_to_data_common_output[bytes] PASSED                                                    [ 51%]
tests/pytesseract_test.py::test_image_to_data_common_output[dict] FAILED                                                     [ 53%]
tests/pytesseract_test.py::test_image_to_data_common_output[string] PASSED                                                   [ 55%]
tests/pytesseract_test.py::test_wrong_prepare_type[int] PASSED                                                               [ 57%]
tests/pytesseract_test.py::test_wrong_prepare_type[float] PASSED                                                             [ 59%]
tests/pytesseract_test.py::test_wrong_prepare_type[none] PASSED                                                              [ 61%]
tests/pytesseract_test.py::test_wrong_tesseract_cmd[executable_name] PASSED                                                  [ 63%]
tests/pytesseract_test.py::test_wrong_tesseract_cmd[absolute_path] PASSED                                                    [ 65%]
tests/pytesseract_test.py::test_main_not_found_cases PASSED                                                                  [ 68%]
tests/pytesseract_test.py::test_proper_oserror_exception_handling[permission_error_path] PASSED                              [ 70%]
tests/pytesseract_test.py::test_proper_oserror_exception_handling[invalid_path] PASSED                                       [ 72%]
tests/pytesseract_test.py::test_get_languages[default_empty_config] PASSED                                                   [ 74%]
tests/pytesseract_test.py::test_get_languages[custom_tessdata_dir] PASSED                                                    [ 76%]
tests/pytesseract_test.py::test_get_languages[incorrect_tessdata_dir] PASSED                                                 [ 78%]
tests/pytesseract_test.py::test_get_languages[invalid_tessdata_dir] PASSED                                                   [ 80%]
tests/pytesseract_test.py::test_get_languages[invalid_config] PASSED                                                         [ 82%]
tests/pytesseract_test.py::test_file_to_dict[input_args0-expected0] PASSED                                                   [ 85%]
tests/pytesseract_test.py::test_file_to_dict[input_args1-expected1] PASSED                                                   [ 87%]
tests/pytesseract_test.py::test_file_to_dict[input_args2-expected2] PASSED                                                   [ 89%]
tests/pytesseract_test.py::test_get_tesseract_version[3.5.0-3.5.0] PASSED                                                    [ 91%]
tests/pytesseract_test.py::test_get_tesseract_version[4.1-a8s6f8d3f-4.1] PASSED                                              [ 93%]
tests/pytesseract_test.py::test_get_tesseract_version[v4.0.0-beta1.9-4.0.0] PASSED                                           [ 95%]
tests/pytesseract_test.py::test_get_tesseract_version_invalid[-Invalid tesseract version: ""] PASSED                         [ 97%]
tests/pytesseract_test.py::test_get_tesseract_version_invalid[invalid-Invalid tesseract version: "invalid"] PASSED           [100%]

============================================================= FAILURES =============================================================
______________________________________________ test_image_to_data_common_output[dict] ______________________________________________

test_file_small = '/usr/ports/graphics/py-pytesseract/work-py39/pytesseract-0.3.9/tests/data/test-small.jpg', output = 'dict'

    @pytest.mark.skipif(
        TESSERACT_VERSION[:2] < (3, 5),
        reason='requires tesseract >= 3.05',
    )
    @pytest.mark.parametrize(
        'output',
        [Output.BYTES, Output.DICT, Output.STRING],
        ids=['bytes', 'dict', 'string'],
    )
    def test_image_to_data_common_output(test_file_small, output):
        """Test and compare the type of the result."""
        result = image_to_data(test_file_small, output_type=output)
        expected_dict_result = {
            'level': [1, 2, 3, 4, 5],
            'page_num': [1, 1, 1, 1, 1],
            'block_num': [0, 1, 1, 1, 1],
            'par_num': [0, 0, 1, 1, 1],
            'line_num': [0, 0, 0, 1, 1],
            'word_num': [0, 0, 0, 0, 1],
            'left': [0, 11, 11, 11, 11],
            'top': [0, 11, 11, 11, 11],
            'width': [79, 60, 60, 60, 60],
            'height': [47, 24, 24, 24, 24],
            # 'conf': ['-1', '-1', '-1', '-1', 96],
            'text': ['', '', '', '', 'This'],
        }

        if output is Output.BYTES:
            assert isinstance(result, bytes)

        elif output is Output.DICT:
            confidence_values = result.pop('conf', None)
            assert confidence_values is not None
>           assert 0 <= confidence_values[-1] <= 100
E           TypeError: '<=' not supported between instances of 'int' and 'str'

tests/pytesseract_test.py:318: TypeError
========================================= 1 failed, 44 passed, 2 skipped in 18.43 seconds ==========================================
ERROR: InvocationError for command /usr/ports/graphics/py-pytesseract/work-py39/pytesseract-0.3.9/.tox/py39/bin/python -bb -m pytest tests (exited with code 1)
_____________________________________________________________ summary ______________________________________________________________
ERROR:   py39: commands failed
bozhodimitrov commented 2 years ago

This is so weird actually. How the hell those tests pass under Linux then? Clearly the conversion breaks somewhere for the FreeBSD setup.

PS: I will have to create dev env for myself, so this might take some time. @mandree can you let me know what the string value of confidence_values[-1] is, for the failing test? I suspect that it is some negative or bogus value.

PS2: This should be resolved now if the value was just a negative number.

mandree commented 2 years ago

The patch in 06e7f80 is insufficient.

confidence_values is [-1, -1, -1, -1, '92.865524'] and that explains it; speaking for Python 3.8 and tesseract 5.0.1: You cannot construct an int from this '92.865524' string because it does not represent an integer. You can however construct a float and then round() or int() it, the latter truncates.

So I am trying this, in the try: block, I have changed val = int(...) to val=int(float(...)) - and then it succeeds on Python 3.7...3.9. I cannot currently test 3.10, this needs more work on FreeBSD since it sees distutils stuff and barfs.

...
for i, head in enumerate(header):
        result[head] = list()
        for row in rows:
            if len(row) <= i:
                continue

            if i != str_col_idx:
                try:
                    val = int(float(row[i]))
                except ValueError:
                    val = row[i]
            else:
                val = row[i]

            result[head].append(val)

    return result
bozhodimitrov commented 2 years ago

I cannot currently test 3.10, this needs more work on FreeBSD since it sees distutils stuff and barfs.

distutils is axed already in the latest release.

For the float addition -- I will add this as well, because it doesn't harm the existing behavior, but it might need a change in the future.

mandree commented 2 years ago

I cannot currently test 3.10, this needs more work on FreeBSD since it sees distutils stuff and barfs.

distutils is axed already in the latest release.

Yup, I'd seen that but not investigated in more detail since that might have looked like infrastructure work, and also for successful tests on Python 3.10 I think we should also have Pandas/NumPy, which hinges on FreeBSD not providing NumPy 1.22 yet (the first version to formally support Python 3.10).

For the float addition -- I will add this as well, because it doesn't harm the existing behavior, but it might need a change in the future.

Thanks.