tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
626 stars 180 forks source link

UnicodeEncodeError: 'ascii' codec can't encode character in Python3 #26

Closed beerjamin closed 6 years ago

beerjamin commented 6 years ago

I have seen a closed issue about this problem before. As suggested I switched to python3 but the problem still persists. Here is the output log

python generate_line_box.py -i "data/train/alexis_ruhe01_1852_0018_022.tif" -t "data/train/alexis_ruhe01_1852_0018_022.gt.txt" > "data/train/alexis_ruhe01_1852_0018_022.box"
Traceback (most recent call last):
  File "generate_line_box.py", line 40, in <module>
    print(u"%s %d %d %d %d 0" % (prev_char, 0, 0, width, height))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u017f' in position 0: ordinal not in range(128)
Makefile:110: recipe for target 'data/train/alexis_ruhe01_1852_0018_022.box' failed
make: *** [data/train/alexis_ruhe01_1852_0018_022.box] Error 1
kba commented 6 years ago

The issue was https://github.com/OCR-D/ocrd-train/issues/18.

My guess is this happens because Python falls back to ascii for terminal output if it cannot determine the terminal to be UTF-8 capable.

Try export PYTHONIOENCODING=utf8 in your shell before you execute the script/makefile, Python should pick it up and not try to encode unicode characters to ascii.

beerjamin commented 6 years ago

That fixed it! Thanks a lot

iraexfl commented 5 years ago

thank youu