tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
599 stars 178 forks source link

Question: What is a line image? #290

Closed NeilduToit13 closed 2 years ago

NeilduToit13 commented 2 years ago

Regarding: "Place ground truth consisting of line images and transcriptions in the folder data/MODEL_NAME-ground-truth." and "Transcriptions must be single-line plain text"

Is it possible for my line images and single-line plain text to consist of just 1 word? Will this impact training at all? Or do I have to use a line with several words for the ground truth? Thank you!

TheFattestTony commented 2 years ago

Don´t know man, i think in this case the only way to obtain this answer is by performing the full training followed by tests.

kba commented 2 years ago

Is it possible for my line images and single-line plain text to consist of just 1 word? Will this impact training at all? Or do I have to use a line with several words for the ground truth? Thank you!

Sure, lines can be a single word or even a single character (like page number) but the less content is in the line, the less context available for the neural network to learn. Some very short lines don't hurt the training AFAICT but the overall training set should be representative of the data to detect later on.

The only thing really forbidden is ground truth text with newlines in them, i.e. multi-line "line" images.

kba commented 2 years ago

performing the full training followed by tests

That is always best, obviously, and feel free to share your findings here as well if you do.

stweil commented 2 years ago

@NeilduToit13, I think your question was answered, so I close this issue.