tesseract-ocr / langdata_lstm

Data used for LSTM model training
Apache License 2.0
115 stars 152 forks source link

Update deu.unicharset #43

Closed OttoKerner closed 3 months ago

OttoKerner commented 3 years ago

The character ı is not part of the german alphabet. It is not commonly used in german texts. All it does is to very frequently mess up OCR results, because it is mistakenly recognized instead of an i.

stweil commented 3 years ago

Meanwhile that character is common even in German texts (especially in names), see file deu.training_text. Updating deu.unicharset won't help as long as the training text adds that character again.

I am afraid your change has to wait until there is a new training with different training text for deu. And then deu.unicharset will be created automatically, so any manual changes are overwritten anyway.

I wonder why the unicharset files are included in langdata_lstm at all. Maybe we should remove all of them.

OttoKerner commented 3 years ago

Is there a documentation how these training texts are generated? Even a cursory glance at it tells me that turkish words are clearly over-represented in it.

stweil commented 3 years ago

No, sorry, we don't know details about the training which was done by Google. It looks like many training texts were extracted from web pages. Here in Mannheim Turkish words are very present in my neighborhood.