Open GoogleCodeExporter opened 9 years ago
Thanks for the detailed information. This is very helpful and explains why my
training has been failing for Georgian recently.
It is easy to split the training data automatically to generate two separate
languages using the unicode ranges that you gave: kat and kat_old.
Looking at the unicode chart though brings some more questions:
What digit characters should modern Georgian include? The Latin digits perhaps?
What punctuation characters and other non-letters? (There is only one: 10fb
explicitly listed for Georgian)
Same questions for "old" Georgian.
Presumably old Georgian should include the lowercase letters in the supplement
at 2d00-2d2f
Also is it worth adding in the archaic letters 10f1-10f6?
Provided the fonts cover them, and I have quite a lot, the 2 different shapes
should be no problem.
Original comment by theraysm...@gmail.com
on 9 Nov 2014 at 6:49
I'll try to answer your questions as best I can based on my experience with
Georgian (I lived in Georgia for three years):
First though, some background: I'm not a linguist, but my understanding is that
"modern" Georgian was standardized in the late 19th century by Ilia
Chavchavadze. Therefore, written Georgian can be roughly classified into three
groups: 1) Modern, Post-Chavchavadze (Mkhedruli), 2) Old, Pre-Chavchavadze
(Mkhedruli), and 3) Archaic (Asomtavruli / Nuskhuri). I'll reference these
grouping when answering the questions below.
>> What digit characters should modern Georgian include? The Latin digits
perhaps?
Definitely Latin. Although Georgian used letters to represent numbers in the
past (groups 2 and 3), and most Georgians probably learned the numeric meanings
in school, modern Georgian (group 1) uses Latin numerals. I downloaded a scan
of an 1877 newspaper published by Chavchavadze from the Georgian National
Library's website and confirmed that it uses Latin numerals. I have scans of
Georgian manuscripts which don't contain Latin numerals (although I haven't
read them closely enough to see whether they use Georgian letters for the
numerals).
>> What punctuation characters and other non-letters?
This is for modern Georgian only (group 1): Georgian uses drop quotes for
quoting, e.g. „ივერია“. Roman numerals are frequently used for
ordinal numbers, so it would be good to include at least I, V, and X. The
character № is often used as an abbreviation for "number"; this appears to be
an import from Russian. Otherwise, punctuation in the modern era seems to be
roughly in line with English / other Western languages.
>> Also is it worth adding in the archaic letters 10f1-10f6?
For modern Georgian (group 1) probably not; Chavchavadze's newspaper doesn't
use them.
>> Same questions for "old" Georgian.
Group 2 (pre-modern Mkhedruli):
>> Digit characters?
Georgian letters used as numerals; no Latin numerals.
>> Punctuation?
10fb, and the manuscripts I have use ":" or "·" (00b7 - middle dot) as word
separators, and I see some commas too.
>> Archaic letters 10f1-10f6?
Yes.
Group 3:
>> Digits?
Georgian letters used as numerals; no Latin numerals.
>> Punctuation?
10fb, ":" and "·" for word separators.
>> Presumably old Georgian should include the lowercase letters in the
supplement at 2d00-2d2f
Correct.
Having said all that, it seems like there are a few possibilities for how to
organize tesseract training:
1) Three trainings: kat, kat_early, kat_old corresponding to Group 1, 2, and 3
respectively. This would probably give the best accuracy across all groups, but
kat_early and kat_old would probably go unused by the majority of users.
2) Two trainings: kat, and kat_old, corresponding to Group 1, and (Groups 2 +
3) respectively. This would reduce the number of trainings, provide the best
accuracy on modern text, but probably reduce accuracy for older texts.
3) Two trainings: kat, and kat_old, corresponding to (Groups 1 + 2) and Group
3, respectively. This might slightly reduce accuracy on modern text, but would
bring the benefit of being able to recognize texts written prior to the 1870s.
I don't have much Georgia-specific knowledge to bring to bear on this question,
except to say that from a user experience perspective, someone trying to
recognize Georgian texts from the last 130+ years would be surprised to see
archaic letters showing up if Tesseract mis-recognized something. So my
preference would be to keep all archaic letters out of the main kat training. A
kat_old that included all three scripts with all archaic letters would probably
be the most useful for scholars dealing with old documents, since they would be
more likely to be dealing with hand-written or degraded documents where custom
training of Tesseract would be useful. Kat_old would basically just serve as a
decent first pass that would allow such users to generate custom models for
their specific corpus.
I'm attaching some examples of each type of script.
Original comment by doh...@gmail.com
on 9 Nov 2014 at 4:41
Attachments:
Original issue reported on code.google.com by
doh...@gmail.com
on 8 Nov 2014 at 4:11