tesseract-ocr / langdata_lstm

Data used for LSTM model training
Apache License 2.0
114 stars 152 forks source link

Trailing spaces on line 27 of eng.punc #28

Open juliangilbey opened 4 years ago

juliangilbey commented 4 years ago

I've not yet worked out whether eng.punc is used by the LSTM mode of tesseract, but I discovered that there are two trailing spaces on line 27 of this file, which might cause the occasional problem.

stweil commented 4 years ago

Which occasional problem are you referring to? If there is a problem, you can create a new traineddata file without those spaces and see whether that fixes the problem.

stweil commented 4 years ago

Link to line 27 in file eng.punc. The trailing spaces are also in eng.traineddata and can be found there in 17 lines. It looks like other languages have them, too.

stweil commented 4 years ago

LSTM and legacy mode use different punc components from the traineddata file, but both have the trailing spaces.

juliangilbey commented 4 years ago

AFAICT, the space on each line indicates where "word characters" ("alphanumerics" for lack of a better term right now - non-punctuation symbols) are expected to appear. So line 1 has a single space, indicating a sequence of [A-Z...] with no punctuation, and other lines have a trailing space to indicate initial punctuation followed by word characters. Except for line 27, every line has precisely one space. I hope that makes sense.

I haven't detected an actual problem yet, but any such problem would likely be very subtle.