tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
626 stars 180 forks source link

Can't encode transcription: '| ঢাকা মেটো-গ |' in language '' #30

Closed xhuvom closed 5 years ago

xhuvom commented 5 years ago

I am trying to train a new language with ben.traineddata . While providing sample training data with lprBD-7.gt.txt and .tif image, I am getting the error

Can't encode transcription: '| ঢাকা মেটো-গ |' in language '' Encoding of string failed! Failure bytes: ffffffe0 ffffffa6 ffffffbe ffffffe0 ffffffa6 ffffff95 ffffffe0 ffffffa6 ffffffbe 20 ffffffe0 ffffffa6 ffffffae ffffffe0 ffffffa7 ffffff87 ffffffe0 ffffffa6 ffffff9f ffffffe0 ffffffa7 ffffff8d ffffffe0 ffffffa6 ffffffb0 ffffffe0 ffffffa7 ffffff8b 2d 20 ffffffe0 ffffffa6 ffffff97 20 7c 20 ffffffe0 ffffffa5 ffffffa4 Can't encode transcription: '\ ঢাকা মেট্রো- গ | ।' in language '' 2 Percent improvement time=0, best error was 7.2 @ 271 At iteration 271/2700/153861, Mean rms=0.211%, delta=0%, char train=0%, word train=0%, skip ratio=5600%, New best char error = 0 wrote best model:data/checkpoints/BigBenww0_271.checkpoint wrote checkpoint.

What changes should be made in order to train a new language?

vijayrajasekaran commented 5 years ago

@xhuvom I am also facing the same issue. Were you able to resolve this?