tesseract-ocr / tesstrain

Train Tesseract LSTM with make
Apache License 2.0
599 stars 178 forks source link

generate_line_box.py fails with empty .txt #334

Closed jbarth-ubhd closed 1 year ago

jbarth-ubhd commented 1 year ago

Sometimes the test images do not contain text but horizontal dividers etc., so "no text", but then generate_line_box.py fails.

I'd suggest this:

diff --git a/generate_line_box.py b/generate_line_box.py
index fa8057c..e3aa3d2 100755
--- a/generate_line_box.py
+++ b/generate_line_box.py
@@ -32,6 +32,9 @@ with io.open(args.txt, "r", encoding='utf-8') as f:
         raise ValueError("ERROR: %s: Ground truth text file should contain exactly one line, not %s" % (args.txt, len(lines)))
     line = unicodedata.normalize('NFC', lines[0].strip())

+if len(line)==0:
+    line=' '
+
 if line:
     for i in range(1, len(line)):
         char = line[i]
zdenop commented 1 year ago

Can you please provide a valid example of the such case? IMHO empty txt is an error that should be fixed and not accepted (with a workaround).

jbarth-ubhd commented 1 year ago

For example this scan: grafik is recognized to r (including a space before & after r).

I've corrected this first to (three spaces, assuming the space before & after is mandatory).

So all images that do not really contain text should not be used for training. And no spaces before & after letters. (?)