tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
626 stars 180 forks source link

Devanagari script box files not being generated #6

Closed Shreeshrii closed 6 years ago

Shreeshrii commented 6 years ago
~/ocrd-train$ make unicharset
python generate_line_box.py -i "data/train/devatest-0001-010001.tif" -t "data/train/devatest-0001-010001-gt.txt" > "data/train/devatest-0001-010001.box"
Traceback (most recent call last):
  File "generate_line_box.py", line 39, in <module>
    if not unicodedata.combining(line[-1]):
IndexError: string index out of range
Makefile:92: recipe for target 'data/train/devatest-0001-010001.box' failed
make: *** [data/train/devatest-0001-010001.box] Error 1
Shreeshrii commented 6 years ago

Getting errors while creating box file for Devanagari script.

Attaching zip file with tif, gt.txt and generated box file.

devatest-0001-010001-gt.zip

Shreeshrii commented 6 years ago

I had generated groundtruth files using tesseract, which added a FF to the OCRed text file. That was the cause of the error.

Changed the command to following to get rid of problem.

tesseract --tessdata-dir ../tessdata   "${img_file}" "${img_file%.*}-gt"  --psm 6  --oem 1  -l san -c page_separator=''