Open yurytch opened 6 years ago
Which language traineddata are you using currently?
I'm using 'rus' from tessdata_best. Tried adding 'bul' and 'srp', to no avail. Would be great if there were an additional datafile just for those glyphs recognition, also with cursive (yat!). Does tesseract work like this?
Try with 'rus' from tessdata_fast and see if that is better.
Try the 'pluschar' training using 'rus' from tessdata_best as the continue_from model. Add at least 15 occurrences of the Old Russian / Church Slavonic glyphs that you want to add so that they get picked us in the unicharset.
Also try with script/Cyrillic (or other appropriate script use for Russian).
Please share about 150 lines of training text which has the added glyphs for testing.
@Shreeshrii While I'm trying to make sense of that plus-training procedure (your point 2): your pt. 1 doesn't work (more OCR errors with 'rus' from *_fast), I don't understand your pt. 3 - 'rus' is Cyrillic anyway, and 'yat' etc. are Cyrillic., too. Regarding the pt. 4: do you mean the training text, like for inclusion in the 'rus' training dataset? But wouldn't you want the graphics with real typeset glyphs for that, too?
Ray has trained for languages eg. Eng, rus and also for scripts in which various languages are written eg. Latin script for english, french, German etc.
My suggestion was for you to use script/Cyrrilic to compare results with rus. In case the letters you want to add are in one of the other languages, they might be recognised.
Re. 4, yes along with training text, also need a font which will render those glyphs correctly.
Please review the following files:
https://github.com/tesseract-ocr/langdata/tree/master/rus https://github.com/tesseract-ocr/langdata/blob/master/rus/desired_characters
https://github.com/tesseract-ocr/langdata/blob/master/Cyrillic.unicharset
Adding these glyphs will require changes in lagdata repo for rus, eg. adding these glyphs to desired_characters file.
Does anybody know about any progress related to the subject - Old Russian support for tesseract ?
@maxirmx, maybe you can contribute by reviewing the files named above?
@stweil, thank you.
https://github.com/tesseract-ocr/langdata/tree/master/rus is 'modern Russian'.
I have asked about older Russian that included three letters were made obsolete in 1917/1918. They were mentioned in the start of this thread: 'yat' (U+0462, U+0463), 'fita' (U+0472, U+0473), and 'izhitsa' (U+0474,U+0475).
I would imagine additional complications as well such as different paragraph sign and different fonts used at that time.
It is somewhat clear what to do, but I do not want to repeat other's work that might be done already.
Okay, "Ѣ" and maybe the other older glyphs are also missing in https://github.com/tesseract-ocr/langdata_lstm/blob/master/script/Cyrillic/Cyrillic.unicharset.
So you will need ground truth data to train a new model based on rus.traineddata
or Cyrillic.traineddata
, but with the additional glyphs. As soon as you have line images with text transcriptions, this process is supported pretty well with tesstrain
.
See also issue https://github.com/tesseract-ocr/langdata_lstm/issues/3 which looks like a duplicate. Maybe you can join efforts.
@stweil are there any requirements for the training words/text (except beforementioned 150 lines)? For example, how many times each new character should be met in training set? Should there be at least 1 capital and non-capital letter? or something like that?
Arbitrary text(s) in old russian can be obtained, for example, from ru.wikisource.org. For example, https://ru.wikisource.org/wiki/%D0%91%D0%BE%D0%B6%D0%B5%D1%81%D1%82%D0%B2%D0%B5%D0%BD%D0%BD%D0%B0%D1%8F_%D0%BA%D0%BE%D0%BC%D0%B5%D0%B4%D0%B8%D1%8F_(%D0%94%D0%B0%D0%BD%D1%82%D0%B5;_%D0%9C%D0%B8%D0%BD)/%D0%94%D0%9E.
I still fail to comprehend the process well enough. But I guess I understand why glyph can't be 'added' to an existing dataset -- because of how the deep learning works, right? But retraining the complete set is rather beyond my resources, in terms of computing power and time.
Here's a thought/question: would it be useful to train a separate (small) set consisting of those missing glyphs and glyphs that look like those missing ones? I.e. consisting of 'YAT's and 'HARD SIGN's. Then one could use it in a set of languages: rus+yat Would this work at all?
Is it possible to add support for the Old Russian / Church Slavonic glyphs, at least for the 'yat' (U+0462, U+0463), 'fita' (U+0472, U+0473), and 'izhitsa' (U+0474,U+0475) ?