tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
834 stars 888 forks source link

Language pack request: Accented Belarusian #299

Closed tryzniak closed 1 year ago

tryzniak commented 1 year ago

Hello. I'd like to do it, at first, on my own, but a bit unsure how to do it. The idea is similar how you had one #8, but I want the same for Belarusian. I have a list of accented words, what to do next? Thank you for any help.

stweil commented 1 year ago

The langdata repository is for legacy models which are rarely used nowadays. Training of such models basically requires a minimal amount of training text which contains all desired glyphs (characters) and fonts to render images from that text.

For "modern" models which use the neural network engine, I suggest using tesstrain with text scanned from printed books. That requires much more work.

tryzniak commented 1 year ago

Thank you for the response! I'll try to do it as you suggested. GL