tesseract-ocr / tessdata

Trained models with fast variant of the "best" LSTM models + legacy models
Apache License 2.0
6.12k stars 2.13k forks source link

Russian language #100

Open juhnowski opened 6 years ago

juhnowski commented 6 years ago

The Russian dictionary is very low quality. In the recognized text is inserted obscene language. Look attentively at the committers of the dictionary and consider whether it is worth continuing cooperation with them.

amitdo commented 6 years ago

Look attentively at the committers of the dictionary and consider whether it is worth continuing cooperation with them.

We'll fire the bots... :laughing:

juhnowski commented 6 years ago

Do you need help in this shooting?

amitdo commented 6 years ago

From https://github.com/tesseract-ocr/tessdata/issues/62#issuecomment-319839971

theraysmith commented on Aug 3, 2017

FYI: The wordlists are generated files, so it isn't a good idea to modify them, as the modifications will likely get overwritten in a future training. To help prevent the ß/B confusion, the words that you want to lose from the wordlists need to go in langdata/lang/lang.bad_words.

See also page 8 in https://github.com/tesseract-ocr/docs/raw/master/das_tutorial2016/6ModernizationEfforts.pdf.