tesseract-ocr / tessdata_best

Best (most accurate) trained LSTM models.
Apache License 2.0
1.21k stars 374 forks source link

old russian / church slavonic glyphs? #24

Open yurytch opened 6 years ago

yurytch commented 6 years ago

Is it possible to add support for the Old Russian / Church Slavonic glyphs, at least for the 'yat' (U+0462, U+0463), 'fita' (U+0472, U+0473), and 'izhitsa' (U+0474,U+0475) ?

Shreeshrii commented 6 years ago

Which language traineddata are you using currently?

yurytch commented 6 years ago

I'm using 'rus' from tessdata_best. Tried adding 'bul' and 'srp', to no avail. Would be great if there were an additional datafile just for those glyphs recognition, also with cursive (yat!). Does tesseract work like this?

Shreeshrii commented 6 years ago
  1. Try with 'rus' from tessdata_fast and see if that is better.

  2. Try the 'pluschar' training using 'rus' from tessdata_best as the continue_from model. Add at least 15 occurrences of the Old Russian / Church Slavonic glyphs that you want to add so that they get picked us in the unicharset.

  3. Also try with script/Cyrillic (or other appropriate script use for Russian).

  4. Please share about 150 lines of training text which has the added glyphs for testing.

yurytch commented 6 years ago

@Shreeshrii While I'm trying to make sense of that plus-training procedure (your point 2): your pt. 1 doesn't work (more OCR errors with 'rus' from *_fast), I don't understand your pt. 3 - 'rus' is Cyrillic anyway, and 'yat' etc. are Cyrillic., too. Regarding the pt. 4: do you mean the training text, like for inclusion in the 'rus' training dataset? But wouldn't you want the graphics with real typeset glyphs for that, too?

Shreeshrii commented 6 years ago

Ray has trained for languages eg. Eng, rus and also for scripts in which various languages are written eg. Latin script for english, french, German etc.

My suggestion was for you to use script/Cyrrilic to compare results with rus. In case the letters you want to add are in one of the other languages, they might be recognised.

Re. 4, yes along with training text, also need a font which will render those glyphs correctly.

Shreeshrii commented 6 years ago

Please review the following files:

https://github.com/tesseract-ocr/langdata/tree/master/rus https://github.com/tesseract-ocr/langdata/blob/master/rus/desired_characters

https://github.com/tesseract-ocr/langdata/blob/master/Cyrillic.unicharset

Adding these glyphs will require changes in lagdata repo for rus, eg. adding these glyphs to desired_characters file.

maxirmx commented 3 years ago

Does anybody know about any progress related to the subject - Old Russian support for tesseract ?

stweil commented 3 years ago

@maxirmx, maybe you can contribute by reviewing the files named above?

maxirmx commented 3 years ago

@stweil, thank you.
https://github.com/tesseract-ocr/langdata/tree/master/rus is 'modern Russian'.
I have asked about older Russian that included three letters were made obsolete in 1917/1918. They were mentioned in the start of this thread: 'yat' (U+0462, U+0463), 'fita' (U+0472, U+0473), and 'izhitsa' (U+0474,U+0475). I would imagine additional complications as well such as different paragraph sign and different fonts used at that time.

It is somewhat clear what to do, but I do not want to repeat other's work that might be done already.

stweil commented 3 years ago

Okay, "Ѣ" and maybe the other older glyphs are also missing in https://github.com/tesseract-ocr/langdata_lstm/blob/master/script/Cyrillic/Cyrillic.unicharset.

So you will need ground truth data to train a new model based on rus.traineddata or Cyrillic.traineddata, but with the additional glyphs. As soon as you have line images with text transcriptions, this process is supported pretty well with tesstrain.

stweil commented 3 years ago

See also issue https://github.com/tesseract-ocr/langdata_lstm/issues/3 which looks like a duplicate. Maybe you can join efforts.

dvrogozh commented 3 years ago

@stweil are there any requirements for the training words/text (except beforementioned 150 lines)? For example, how many times each new character should be met in training set? Should there be at least 1 capital and non-capital letter? or something like that?

Arbitrary text(s) in old russian can be obtained, for example, from ru.wikisource.org. For example, https://ru.wikisource.org/wiki/%D0%91%D0%BE%D0%B6%D0%B5%D1%81%D1%82%D0%B2%D0%B5%D0%BD%D0%BD%D0%B0%D1%8F_%D0%BA%D0%BE%D0%BC%D0%B5%D0%B4%D0%B8%D1%8F_(%D0%94%D0%B0%D0%BD%D1%82%D0%B5;_%D0%9C%D0%B8%D0%BD)/%D0%94%D0%9E.

yurytch commented 3 years ago

I still fail to comprehend the process well enough. But I guess I understand why glyph can't be 'added' to an existing dataset -- because of how the deep learning works, right? But retraining the complete set is rather beyond my resources, in terms of computing power and time.

Here's a thought/question: would it be useful to train a separate (small) set consisting of those missing glyphs and glyphs that look like those missing ones? I.e. consisting of 'YAT's and 'HARD SIGN's. Then one could use it in a set of languages: rus+yat Would this work at all?