kcobra / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

Remove archaic letters from Georgian training_text #1376

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Following up on this thread: 
https://groups.google.com/forum/#!topic/tesseract-ocr/e_zB5KpRODI

The file 
https://code.google.com/p/tesseract-ocr/source/browse/kat/kat.training_text?repo
=langdata contains letters from two different Georgian scripts: Asomtavruli and 
Mkhedruli. Asomtavruli is an archaic script that is no longer used in modern 
written Georgian except in limited situations such as church iconography; 
Mkhedruli is the only script that is used in modern written Georgian.

Examples of Asomtavruli from kat.training_text are: 
ႱႠႫႠႰႧႠႪႨ, ႫႨႫႠႰႧႥႤႡႨ, and 
ႠႰႠႱႠႮႠႲႨႭႣ.
Examples of Mkhedruli are: საქართველო, 
რეჟისორი, and სხვადასხვა

More precisely, Asomtavruli is made up of Unicode points 10A0 through 10CD. 
Mkhedruli is points 10D0 through 10F0 (excluding archaic letters and extra 
letters from other languages that use the Mkhedruli script). See 
http://www.unicode.org/charts/PDF/U10A0.pdf

Despite the description of the Asomtavruli alphabet as "capital" letters, these 
letters are not used in modern written Georgian, which is caseless. I suggest 
removing the Asomtavruli letters or switching to a different training text 
which doesn't include them, since their inclusion may reduce accuracy in 
Georgian for most texts.

I'm happy to help generate / edit any training files necessary; I have 
previously trained Tesseract on Georgian; my training files are available at 
https://dl.dropboxusercontent.com/u/11840441/kat_train.zip . One known issue 
with that training data is that the numeral 4 is missing, as well as a lot of 
punctuation. However, the results are acceptable otherwise, and it contains 
some training files that are currently missing from the repo, such as 
unicharambigs. The text I used is not in the public domain; a good source for 
public-domain Georgian text is The Knight in the Panther's Skin, available 
here: http://www.georgianweb.com/language/geo/shota/shesavali.html

A final note about Georgian: there are two letters which have drastically 
differing representations between fonts. These are ლ and ჯ, which can also 
be represented http://en.wikipedia.org/wiki/File:Lasi_(other_form).svg and 
http://en.wikipedia.org/wiki/File:ჯ_(other_form).png , respectively. 
Tesseract needs to be able to recognize these forms as well.

Let me know what I can do to help; I'd love to see high-quality Georgian 
recognition in Tesseract!

Original issue reported on code.google.com by doh...@gmail.com on 8 Nov 2014 at 4:11

GoogleCodeExporter commented 9 years ago
Thanks for the detailed information. This is very helpful and explains why my 
training has been failing for Georgian recently.
It is easy to split the training data automatically to generate two separate 
languages using the unicode ranges that you gave: kat and kat_old.

Looking at the unicode chart though brings some more questions:
What digit characters should modern Georgian include? The Latin digits perhaps?
What punctuation characters and other non-letters? (There is only one: 10fb 
explicitly listed for Georgian)

Same questions for "old" Georgian.
Presumably old Georgian should include the lowercase letters in the supplement 
at 2d00-2d2f
Also is it worth adding in the archaic letters 10f1-10f6?

Provided the fonts cover them, and I have quite a lot, the 2 different shapes 
should be no problem.

Original comment by theraysm...@gmail.com on 9 Nov 2014 at 6:49

GoogleCodeExporter commented 9 years ago
I'll try to answer your questions as best I can based on my experience with 
Georgian (I lived in Georgia for three years):

First though, some background: I'm not a linguist, but my understanding is that 
"modern" Georgian was standardized in the late 19th century by Ilia 
Chavchavadze. Therefore, written Georgian can be roughly classified into three 
groups: 1) Modern, Post-Chavchavadze (Mkhedruli), 2) Old, Pre-Chavchavadze 
(Mkhedruli), and 3) Archaic (Asomtavruli / Nuskhuri). I'll reference these 
grouping when answering the questions below.

>> What digit characters should modern Georgian include? The Latin digits 
perhaps?
Definitely Latin. Although Georgian used letters to represent numbers in the 
past (groups 2 and 3), and most Georgians probably learned the numeric meanings 
in school, modern Georgian (group 1) uses Latin numerals. I downloaded a scan 
of an 1877 newspaper published by Chavchavadze from the Georgian National 
Library's website and confirmed that it uses Latin numerals. I have scans of 
Georgian manuscripts which don't contain Latin numerals (although I haven't 
read them closely enough to see whether they use Georgian letters for the 
numerals).

>> What punctuation characters and other non-letters?
This is for modern Georgian only (group 1): Georgian uses drop quotes for 
quoting, e.g. „ივერია“. Roman numerals are frequently used for 
ordinal numbers, so it would be good to include at least I, V, and X. The 
character № is often used as an abbreviation for "number"; this appears to be 
an import from Russian. Otherwise, punctuation in the modern era seems to be 
roughly in line with English / other Western languages.

>> Also is it worth adding in the archaic letters 10f1-10f6?
For modern Georgian (group 1) probably not; Chavchavadze's newspaper doesn't 
use them.

>> Same questions for "old" Georgian.
Group 2 (pre-modern Mkhedruli):
>> Digit characters?
Georgian letters used as numerals; no Latin numerals.

>> Punctuation?
10fb, and the manuscripts I have use ":" or "·" (00b7 - middle dot) as word 
separators, and I see some commas too.

>> Archaic letters 10f1-10f6?
Yes.

Group 3:
>> Digits?
Georgian letters used as numerals; no Latin numerals.

>> Punctuation?
10fb, ":" and "·" for word separators.

>> Presumably old Georgian should include the lowercase letters in the 
supplement at 2d00-2d2f
Correct.

Having said all that, it seems like there are a few possibilities for how to 
organize tesseract training:
1) Three trainings: kat, kat_early, kat_old corresponding to Group 1, 2, and 3 
respectively. This would probably give the best accuracy across all groups, but 
kat_early and kat_old would probably go unused by the majority of users.
2) Two trainings: kat, and kat_old, corresponding to Group 1, and (Groups 2 + 
3) respectively. This would reduce the number of trainings, provide the best 
accuracy on modern text, but probably reduce accuracy for older texts.
3) Two trainings: kat, and kat_old, corresponding to (Groups 1 + 2) and Group 
3, respectively. This might slightly reduce accuracy on modern text, but would 
bring the benefit of being able to recognize texts written prior to the 1870s.

I don't have much Georgia-specific knowledge to bring to bear on this question, 
except to say that from a user experience perspective, someone trying to 
recognize Georgian texts from the last 130+ years would be surprised to see 
archaic letters showing up if Tesseract mis-recognized something. So my 
preference would be to keep all archaic letters out of the main kat training. A 
kat_old that included all three scripts with all archaic letters would probably 
be the most useful for scholars dealing with old documents, since they would be 
more likely to be dealing with hand-written or degraded documents where custom 
training of Tesseract would be useful. Kat_old would basically just serve as a 
decent first pass that would allow such users to generate custom models for 
their specific corpus.

I'm attaching some examples of each type of script.

Original comment by doh...@gmail.com on 9 Nov 2014 at 4:41

Attachments: