tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
625 stars 180 forks source link

generate_line_box.py should bomb out if GT has multiple lines #162

Closed juliangilbey closed 4 years ago

juliangilbey commented 4 years ago

Hi!

Because the generated box files are useless in the case of a multiline image, the current code is very misleading and breaks training if fed multiline ground truths/images. (I only just realised why my training is not working...)

My suggestion is that the code is modified in two ways:

(1) If len(lines) > 1, then exit with a suitably informative message.

(2) Remove the for loop, and replace "line.strip" with "line[0].strip" in the normalize line.

Best wishes,

Julian

wrznr commented 4 years ago

@juliangilbey Thanks for the suggestions!

However, I would not say that the code is “misleading”: The documentation clearly states:

Transcriptions must be single-line plain text ...

There are suggestions for multi-line training by @Shreeshrii which are also mentioned in the README.

kba commented 4 years ago

The generate_line_box.py script DOES expect single-line GT though, so an error if there's more than one line in the file would not hurt IMHO.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

juliangilbey commented 4 years ago

Thanks stale bot! The issue still remains, though: just because the documentation says don't have a multiple-line ground truth doesn't mean that people will notice that line of the documentation or remember it. Generating an error message would be much more helpful.

wrznr commented 4 years ago

@juliangilbey @kba Agreed.