tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
625 stars 180 forks source link

[Question] Why are .box files generated contain symbols on top of each other? #142

Closed Asa-Nisi-Masa closed 4 years ago

Asa-Nisi-Masa commented 4 years ago

Hi. It seems that when .box files are generated, all the characters' coordinates are the same. I.e. - the characters are essentially on top of each other. Is this intended, and if so, what am I missing?

wrznr commented 4 years ago

It is intended. The coordinates in the box files correspond to the lines which contain the character. Since version 4, Tesseract does not require coordinates on the character level anymore (which makes life a lot easier).

Asa-Nisi-Masa commented 4 years ago

It is intended. The coordinates in the box files correspond to the lines which contain the character. Since version 4, Tesseract does not require coordinates on the character level anymore (which makes life a lot easier).

Thank you for the quick response. One more question: if my problem at hand involes real-world images, does it make sense to train Tesseract on 'messy' real-world images (printed text), or does Tesseract prefer synthetic black-on-white text?

wrznr commented 4 years ago

Many people have quite successfully trained Tesseract on images of real printings. Actually, to step beyond synthetic materials was the initial purpose of the tesstrain tools. @stweil has documented his approach in a very detailed form in the wiki. I have even trained Tesseract with handwritten text, although the documentation is not yet finished. However, a good preprocessing is helpful (or even necessary).