Box files generated on the basis of ground truth text.

tesseract-ocr / tesstrain

Train Tesseract LSTM with make

Apache License 2.0

625 stars 180 forks source link

I noticed that when we put different ground truth for the same image, we get different box files.

For RTL languages like Arabic, the box file generation script will just run the GT text through bidi.algorithm.get_display and create box file with a single WordStr box, that is to be expected. IIUC the character/word boxes are a legacy from Tesseract <= 3, we just need them to create the .lstmf files.

Will the amount of space between any two consecutive words matter?

In my experience, no, you should not try to represent larger space in the GT with multiple spaces. That experience is based on LTR Latin-based scripts though, @Shreeshrii knows best about RTL scripts.

tesseract-ocr / tesstrain

Box files generated on the basis of ground truth text. #181