tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
626 stars 180 forks source link

'ascii' codec can't encode character u'\u0645' in position 0: ordinal not in range(128) #63

Closed mshakirDr closed 5 years ago

mshakirDr commented 5 years ago

I am trying to run Urdu training data (using Noori Nastaleeq font) but make training urd results in the following:

python generate_line_box.py -i "data/ground-truth/longJameel_Noori_NastaleeqRegular1610.tif" -t "data/ground-truth/longJameel_Noori_NastaleeqRegular1610.gt.txt" > "data/ground-truth/longJameel_Noori_NastaleeqRegular1610.box"
Traceback (most recent call last):
  File "generate_line_box.py", line 41, in <module>
    print(u"%s %d %d %d %d 0" % (prev_char, 0, 0, width, height))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0645' in position 0: ordinal not in range(128)
Makefile:111: recipe for target 'data/ground-truth/longJameel_Noori_NastaleeqRegular1610.box' failed

I have attached the problematic line image with .gt.txt. The files are generated on Windows uisng GDI and .net and imported to Linux. Putting urd.traineddata beforehand doesn't help as well. output.zip

mshakirDr commented 5 years ago

Paste the following at the start of generate_line_box.py. File editing won't work in Linux Subsystem for Windows (Permission denied error). Virtual machine or a Linux Machine is the solution.

# encoding=utf8
import sys
reload(sys)
sys.setdefaultencoding('utf8')