Closed OttoKerner closed 3 months ago
Meanwhile that character is common even in German texts (especially in names), see file deu.training_text. Updating deu.unicharset won't help as long as the training text adds that character again.
I am afraid your change has to wait until there is a new training with different training text for deu. And then deu.unicharset will be created automatically, so any manual changes are overwritten anyway.
I wonder why the unicharset files are included in langdata_lstm at all. Maybe we should remove all of them.
Is there a documentation how these training texts are generated? Even a cursory glance at it tells me that turkish words are clearly over-represented in it.
No, sorry, we don't know details about the training which was done by Google. It looks like many training texts were extracted from web pages. Here in Mannheim Turkish words are very present in my neighborhood.
The character
ı
is not part of the german alphabet. It is not commonly used in german texts. All it does is to very frequently mess up OCR results, because it is mistakenly recognized instead of ani
.