Open Shreeshrii opened 7 years ago
Devanagari - Vedic Accents
https://github.com/tesseract-ocr/tesseract/blob/master/training/validator.cpp#L178
bool Validator::IsVedicAccent(char32 unicode) {
return 0x1cd0 <= unicode && unicode < 0x1d00;
}
Please see
The following should also be included as Vedic Accents.
U+0951..U+0954 are a set of combining marks used in transcription of Sanskrit texts.
Vedic tone marks
0951 $॑ DEVANAGARI STRESS SIGN UDATTA
= Vedic tone svarita
• mostly used for svarita, with rare use for udatta
• used also in Vedic texts written in other scripts
→ 1CDA $᳚ vedic tone double svarita
0952 $॒ DEVANAGARI STRESS SIGN ANUDATTA
= Vedic tone anudatta
• used also in Vedic texts written in other scripts
→ 1CDC $᳜ vedic tone kathaka anudatta
Possibly also
Accent marks
0953 $॓ DEVANAGARI GRAVE ACCENT
→ 0300 $̀ combining grave accent
0954 $॔ DEVANAGARI ACUTE ACCENT
→ 0301 $́ combining acute accent
Devanagari Extended: U+A8E0–U+A8FF This block of characters is used chiefly for Vedic Sanskrit, although many of the characters are generic and can be used by other Indic scripts. The block includes a set of combining digits, letters, and avagraha which is used as a system of cantillation marks in the early Vedic Sanskrit texts. The Devanagari Extended block also includes nasalization marks (candrabindu), and a number of editorial marks.
Also include the ranges
Devanagari - Words cannot begin with
Various signs
0900 $ऀ DEVANAGARI SIGN INVERTED CANDRABINDU
= vaidika adhomukha candrabindu
0901 $ँ DEVANAGARI SIGN CANDRABINDU
= anunasika
→ 0310 $̐ combining candrabindu
0902 $ं DEVANAGARI SIGN ANUSVARA
= bindu
0903 $ः DEVANAGARI SIGN VISARGA
Various signs
093C $़ DEVANAGARI SIGN NUKTA
• for extending the alphabet to new letters
093D ऽ DEVANAGARI SIGN AVAGRAHA
and the various dependent vowel signs.
Most of these maybe covered by https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_grapheme.cpp#L55
if (char_type == U_NON_SPACING_MARK || char_type == U_ENCLOSING_MARK ||
char_type == U_COMBINING_SPACING_MARK || ch == kZeroWidthNonJoiner ||
ch == kZeroWidthJoiner)
return CharClass::kCombiner;
Please check about Avagraha - 093D.
Devanagari - Eyelash Ra for Marathi
R5a For compatibility with The Unicode Standard, Version 2.0, if the dead consonant
RAd precedes zero width joiner, then the half-consonant form RAh , depicted as
eyelash-RA, is used instead of RAsup .
Page 13 in http://www.unicode.org/versions/Unicode10.0.0/ch12.pdf
Removal of ZWJ in this case will lead to incorrect results.
@zdenop Please label as help wanted
.
@zdenop I can support Kannada language. Please task me anything. I am Kannada native and I have full knowledge in Unicode for Kannada script.
so have a look at above from @Shreeshrii
@Shreeshrii @zdenop I am writing couple of cases for Kannada. Please review it, if this is what you are looking for, I will compile the full list.
In Unicode: a. This is achieved by adding ZWJ (U+200D) between Ra and halant to stop the repha formation. b. This is achieved by shaping engine, no special characters added. Ra+halant c. This is achieved by the font, no special characters added. Ra+halant+Ra + any vowel sign d. This is achieved by adding ZWJ (U+200D) between Ra and halant to stop the repha formation.
For the OCR, to bridge the printed text to Unicode correctly. a. When the text has Ra with a subscript/consonant conjugate with it in printed text. The output shall be Ra + ZWJ + halant + any consonant but Ra. b. Ra+ halant, no special characters to be inserted. c. When the text has Ra with Ra as subscript/consonant conjugate with it in printed text. The output shall be Ra + halant + Ra + any vowel sign d. Same as above case "a". Ra + ZWJ + halant + any consonant but Ra.
Usage of Zero width non-joiner, ZWNJ (U+200C) Text: ರಾಜ್ಕುಮಾರ್ Unicode: U+0CB0 U+0CBE U+0C9C U+0CCD U+200C U+0C95 U+0CC1 U+0CAE U+0CBE U+0CB0 U+0CCD Exception: When "any Kannada consonant" "halant (U+0CCD)" "any Kannada consonant", to avoid subscript/consonant conjugate, ZWNJ U+200C is added after the halant (U+0CCD) If ZWNJ is not present, the above text will be ರಾಜ್ಕುಮಾರ್.
Nukta U+0CBC
Text: ಜಫ಼ಾರ್ ಜಫ಼್ಲರ್ ಜಫ್ಫ಼ರ್ ಜ಼ಾರ್ ಜ್ಜ಼ಾರ್
Image:
Nukta can be with base letter or the subscript. In unicode: Nukta is always placed next to the consonant before the vowel matra or halant. When the nukta is in subscript, it is placed after the consonant in the subscript.
@MayuraVerma Thanks for your detailed input.
Please see the rules at https://github.com/tesseract-ocr/tesseract/blob/7cc97c25ca32a9e8e7e991587064abae51b22f65/src/training/validate_indic.cpp#L96
and check whether they are OK for Kannada.
If you can help for a particular script, please comment below.
Comments from Ray - copied from https://github.com/tesseract-ocr/tesseract/issues/995 Read the thread for full context.
it would be useful to have any experts in any of the following scripts review the new corpus cleanup code,and make comments:
Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada,Malayalam, Sinhala, Thai, Myanmar, Khmer.
There are script-specific cleanup rules in there. Since I plan to commit new copies of the training data (unicharsets, wordlists, training text etc) then at that point they will match.
eg. The code determines what makes a valid/invalid sequence of unicodes in the script, for instance, is it allowed to have two matras in a row? It gets more difficult with questions over what category the additional characters are.
Major new normalization/text cleanup code in training/validat* The best help with this would be expertise in the various scripts, as previously discussed.
https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_grapheme.cpp https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_grapheme.h
https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_indic.cpp https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_indic.h
https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_khmer.cpp https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_khmer.h
https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_myanmar.cpp https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_myanmar.h
https://github.com/tesseract-ocr/tesseract/blob/master/training/validator.cpp https://github.com/tesseract-ocr/tesseract/blob/master/training/validator.h