tesseract-ocr / langdata_lstm

Data used for LSTM model training
Apache License 2.0
115 stars 152 forks source link

Is it possible to add few pre-1918 Russian characters to RUS language files? #3

Open alexei-kouprianov opened 5 years ago

alexei-kouprianov commented 5 years ago

In 1917--1918, the Russian language was reformed in many ways including but not limited to the banning of four letters: I-decimal (now known as "Byelorussian-Ukrainian I"), Yat, Fita, and Izhitsa. The necessity to OCR the texts published in Russia from 1708 through 1918 (and somewhat later) is widely recognised among scholars but they are largely unfamiliar with the ways tesseract can be trained to recognise these missing characters (and, I have to confess, the vast majority of ordinary people will be absolutely unable to train tesseract even if they read the instructions [ https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#fine-tuning-for--a-few-characters ]). See also: https://en.wikipedia.org/wiki/Russian_alphabet#Letters_eliminated_in_1918

Is there a possibility to include in the desired characters list for Russian ( langdata_lstm/rus/desired_characters ) the following glyphs:

§ : Section sign ; Unicode number: U+00A7

І : Cyrillic Capital Letter Byelorussian-Ukrainian I ; Unicode number: U+0406 і : Cyrillic Small Letter Byelorussian-Ukrainian I ; Unicode number: U+0456 Ѣ : Cyrillic Capital Letter Yat ; Unicode number: U+0462 ѣ : Cyrillic Small Letter Yat ; Unicode number: U+0463 Ѳ : Cyrillic Capital Letter Fita ; Unicode number: U+0472 ѳ : Cyrillic Small Letter Fita ; Unicode number: U+0473 Ѵ : Cyrillic Capital Letter Izhitsa ; Unicode number: U+0474 ѵ : Cyrillic Small Letter Izhitsa ; Unicode number: U+0475

What else should be provided to add these few characters? A list of words containing these letters? How long should that list be? I am working currently on a project which processes lots of geographic names in pre-1918 Russian (and some other texts), so I can provide at least a list of words of considerable length. For now, I have to resort to OCR the pre-1918 text as a post-1918 and insert the missing four characters manually (mostly, two of them, as Fita and, especially, Izhitsa were rather less frequent).

Or this would rather require a much larger effort like creating a special rus-old model?

stweil commented 5 years ago

Yes, this is possible. I think the resulting model should not replace rus, but be a new rus_old, because otherwise Tesseract might "recognize" the old characters in modern texts, too.

I assume that the missing section sign will be needed for rus and for rus_old. The Tesseract wiki explains how the fixed or new models can be created based on the existing model.

amitdo commented 5 years ago

Your first step should be finding/making ground truth text from images of pre-1918 Russian books and/or newspapers.

stweil commented 5 years ago

The Byelorussian-Ukrainian I (upper and lower case) is included in scripts/Cyrillic.traineddata: I see it in the unicharset file.

alexei-kouprianov commented 5 years ago

@stweil and @amitdo, thank you for the comments. As I figured out, a new rus_old model is a better solution. I shall try to prepare a set of words in pre-1918 Russian for training and come back to the issue after that. I am not sure I will be able to decipher the training instructions on my own but they are anyway of no use without a good deal of text to be used on.