Open marijani101 opened 4 years ago
The hocr-tools are working with Python 3 already. We support Python 2 and 3 together.
Maybe some compatibility constructs can be removed from the code as soon as Python 2 is gone, but for the moment I think there is nothing to be done. @marijani101, did you notice problems with Python 3 which would require an action now? I only found that the README could be updated to mention Python 3 as well. Maybe you want to send a pull request for that?
The Python 2 package names in the README.md should be replaced by Python 3 package names. @marijani101, can you send a pull request?
It seems like while the setup file still advertises Python 2, https://github.com/ocropus/hocr-tools/commit/269d63a816dc801b77e549b9c3b3bde708912286 basically drops this support in the most recent time. This contradicts with the following code, as f-strings have not been available before Python 3.6: https://github.com/ocropus/hocr-tools/blob/0ad95b3606229c8a6895a3a6e782ff88d9db1d8d/setup.py#L24-L26
Right, thanks for reporting this. Do you want to send a pull request which removes all old entries? All Python versions before 3.7 are unsupported.
I just did some more tests regarding version support and stumbled upon some more stuff which probably needs some attention (and is more or less related to Python 3 support):
CI fails due to beautifulsoup4 never being installed, but apparently being called by the tests (according to the code, the package is optional for regular installations):
# ocrx_word argument
not ok 17 - Failed: hocr-extract-images -U -p word-%03d.png -b ../testdata -e ocrx_word ../testdata/tess.hocr
---
diag: |
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.7.15/x64/bin/hocr-extract-images", line 79, in <module>
from bs4 import UnicodeDammit
ModuleNotFoundError: No module named 'bs4'
...
hocr-extract-images
(fixed by using doc = html.document_fromstring(content.encode('utf-8'), parser=parser)
instead:
# ocrx_word argument
not ok 17 - Failed: hocr-extract-images -U -p word-%03d.png -b ../testdata -e ocrx_word ../testdata/tess.hocr
---
diag: |
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.7.15/x64/bin/hocr-extract-images", line 83, in <module>
doc = html.document_fromstring(content, parser=parser)
File "/opt/hostedtoolcache/Python/3.7.15/x64/lib/python3.7/site-packages/lxml/html/__init__.py", line 759, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "src/lxml/etree.pyx", line 3257, in lxml.etree.fromstring
File "src/lxml/parser.pxi", line 1911, in lxml.etree._parseMemoryDocument
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
...
Any news on this issue?
As Python 2 is coming to an end, wouldn't it be better to migrate to Python 3?