tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
637 stars 188 forks source link

Make Training Errors with Sample Data #50

Closed MrDrProfK closed 5 years ago

MrDrProfK commented 5 years ago

OS: Ubuntu 18.04

What I typed in Terminal: make training

What I received: python generate_line_box.py -i "data/ground-truth/andreas_fenitschka_1898_0085_025.tif" -t "data/ground-truth/andreas_fenitschka_1898_0085_025.gt.txt" > "data/ground-truth/andreas_fenitschka_1898_0085_025.box" Traceback (most recent call last): File "generate_line_box.py", line 41, in print(u"%s %d %d %d %d 0" % (prev_char, 0, 0, width, height)) UnicodeEncodeError: 'ascii' codec can't encode character u'\u017f' in position 0: ordinal not in range(128) Makefile:111: recipe for target 'data/ground-truth/andreas_fenitschka_1898_0085_025.box' failed make: *** [data/ground-truth/andreas_fenitschka_1898_0085_025.box] Error 1

kba commented 5 years ago

Quick fix: Try running it with python3, i.e. replace 'python' with 'python3'.

Please provide test data if problem persists.

MrDrProfK commented 5 years ago

Issue persists after running the "generate_line_box.py" script with a python3 shebang.

Test data is the zipped data from the master branch (unzipped and copied into the directory specified in the README, of course).

MrDrProfK commented 5 years ago

Technically error is slightly different with Python3 (currently running v. 3.6.7).

python generate_line_box.py -i "data/ground-truth/andreas_fenitschka_1898_0165_024.tif" -t "data/ground-truth/andreas_fenitschka_1898_0165_024.gt.txt" > "data/ground-truth/andreas_fenitschka_1898_0165_024.box" Traceback (most recent call last): File "generate_line_box.py", line 41, in print(u"%s %d %d %d %d 0" % (prev_char, 0, 0, width, height)) UnicodeEncodeError: 'ascii' codec can't encode character u'\u201e' in position 0: ordinal not in range(128) Makefile:111: recipe for target 'data/ground-truth/andreas_fenitschka_1898_0165_024.box' failed make: *** [data/ground-truth/andreas_fenitschka_1898_0165_024.box] Error 1

trinitybest commented 5 years ago

I feel your issue is from the print command. the print command can not print that unicode character (u'\u017f' OR u'\u201e' in your case) to the stdout.

Your issue should be the same as: https://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20 Please try the following solution: import sys reload(sys) sys.setdefaultencoding('utf8')

Hope this will help:)

MrDrProfK commented 5 years ago

@trinitybest Thanks for the suggestion. Unfortunately, I received the following error when trying to use the reload command: NameError: name 'reload' is not defined

ameera3 commented 5 years ago

https://github.com/OCR-D/ocrd-train/issues/18

ameera3 commented 5 years ago

https://github.com/OCR-D/ocrd-train/issues/26

wrznr commented 5 years ago

@ameera3 The issues your are referring to do not exist...

torhhu commented 5 years ago

Setting export PYTHONIOENCODING=utf-8 prior to running make training helped me. Still have an warning about invalid resolution that I would like to fix.

wrznr commented 5 years ago

This is a tesseract (maybe even a leptonica) warning. Can't do very much to fix this. Sry.

ameera3 commented 5 years ago

The issues do exist. The links are broken, I think, because the issues are closed. Go to the closed issues and look for

unicode error python #18

UnicodeEncodeError: 'ascii' codec can't encode character in Python3 #26

to find the solutions.