tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
625 stars 180 forks source link

Box files generated on the basis of ground truth text. #181

Closed talha1503 closed 4 years ago

talha1503 commented 4 years ago

Hi! I noticed that when we put different ground truth for the same image, we get different box files. The characters are having same co-ordinates as a word level string is considered instead of character level. So while creating a ground truth, suppose my image is like this : 1

Will the amount of space between any two consecutive words matter? Can you please have a look at my ground truth. I have just kept the single spacing between the words and numbers @Shreeshrii sir. can you please help me out? 1.gt.txt

kba commented 4 years ago

I noticed that when we put different ground truth for the same image, we get different box files.

For RTL languages like Arabic, the box file generation script will just run the GT text through bidi.algorithm.get_display and create box file with a single WordStr box, that is to be expected. IIUC the character/word boxes are a legacy from Tesseract <= 3, we just need them to create the .lstmf files.

Will the amount of space between any two consecutive words matter?

In my experience, no, you should not try to represent larger space in the GT with multiple spaces. That experience is based on LTR Latin-based scripts though, @Shreeshrii knows best about RTL scripts.