Need Indic scripts experts to review cleanup code

Shreeshrii commented 7 years ago

If you can help for a particular script, please comment below.

Comments from Ray - copied from https://github.com/tesseract-ocr/tesseract/issues/995 Read the thread for full context.

it would be useful to have any experts in any of the following scripts review the new corpus cleanup code,and make comments:

Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada,Malayalam, Sinhala, Thai, Myanmar, Khmer.

There are script-specific cleanup rules in there. Since I plan to commit new copies of the training data (unicharsets, wordlists, training text etc) then at that point they will match.

eg. The code determines what makes a valid/invalid sequence of unicodes in the script, for instance, is it allowed to have two matras in a row? It gets more difficult with questions over what category the additional characters are.

Major new normalization/text cleanup code in training/validat* The best help with this would be expertise in the various scripts, as previously discussed.

https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_grapheme.cpp https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_grapheme.h

https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_indic.cpp https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_indic.h

https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_khmer.cpp https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_khmer.h

https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_myanmar.cpp https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_myanmar.h

https://github.com/tesseract-ocr/tesseract/blob/master/training/validator.cpp https://github.com/tesseract-ocr/tesseract/blob/master/training/validator.h

Shreeshrii commented 7 years ago

Devanagari - Vedic Accents

https://github.com/tesseract-ocr/tesseract/blob/master/training/validator.cpp#L178

bool Validator::IsVedicAccent(char32 unicode) {
  return 0x1cd0 <= unicode && unicode < 0x1d00;
}

Please see

The following should also be included as Vedic Accents.

U+0951..U+0954 are a set of combining marks used in transcription of Sanskrit texts.

Vedic tone marks
0951 $॑ DEVANAGARI STRESS SIGN UDATTA
= Vedic tone svarita
• mostly used for svarita, with rare use for udatta
• used also in Vedic texts written in other scripts
→ 1CDA $᳚  vedic tone double svarita
0952 $॒ DEVANAGARI STRESS SIGN ANUDATTA
= Vedic tone anudatta
• used also in Vedic texts written in other scripts
→ 1CDC $᳜  vedic tone kathaka anudatta

Possibly also

Accent marks
0953 $॓ DEVANAGARI GRAVE ACCENT
→ 0300 $̀  combining grave accent
0954 $॔ DEVANAGARI ACUTE ACCENT
→ 0301 $́  combining acute accent

Devanagari Extended: U+A8E0–U+A8FF This block of characters is used chiefly for Vedic Sanskrit, although many of the characters are generic and can be used by other Indic scripts. The block includes a set of combining digits, letters, and avagraha which is used as a system of cantillation marks in the early Vedic Sanskrit texts. The Devanagari Extended block also includes nasalization marks (candrabindu), and a number of editorial marks.

Also include the ranges

A8E0-A8F1 Combining Marks
A8F2-A8F7 Marks of Nasalization

Shreeshrii commented 7 years ago

Devanagari - Words cannot begin with

Various signs
0900 $ऀ DEVANAGARI SIGN INVERTED CANDRABINDU
= vaidika adhomukha candrabindu
0901 $ँ DEVANAGARI SIGN CANDRABINDU
= anunasika
→ 0310 $̐  combining candrabindu
0902 $ं DEVANAGARI SIGN ANUSVARA
= bindu
0903 $ः DEVANAGARI SIGN VISARGA

Various signs
093C $़ DEVANAGARI SIGN NUKTA
• for extending the alphabet to new letters
093D ऽ DEVANAGARI SIGN AVAGRAHA

and the various dependent vowel signs.

Most of these maybe covered by https://github.com/tesseract-ocr/tesseract/blob/master/training/validate_grapheme.cpp#L55

  if (char_type == U_NON_SPACING_MARK || char_type == U_ENCLOSING_MARK ||
      char_type == U_COMBINING_SPACING_MARK || ch == kZeroWidthNonJoiner ||
      ch == kZeroWidthJoiner)
    return CharClass::kCombiner;

Please check about Avagraha - 093D.

Shreeshrii commented 7 years ago

Devanagari - Eyelash Ra for Marathi

R5a For compatibility with The Unicode Standard, Version 2.0, if the dead consonant
RAd precedes zero width joiner, then the half-consonant form RAh , depicted as
eyelash-RA, is used instead of RAsup .

Page 13 in http://www.unicode.org/versions/Unicode10.0.0/ch12.pdf

Removal of ZWJ in this case will lead to incorrect results.

Shreeshrii commented 5 years ago

@zdenop Please label as help wanted.

MayuraVerma commented 5 years ago

@zdenop I can support Kannada language. Please task me anything. I am Kannada native and I have full knowledge in Unicode for Kannada script.

zdenop commented 5 years ago

so have a look at above from @Shreeshrii

MayuraVerma commented 5 years ago

@Shreeshrii @zdenop I am writing couple of cases for Kannada. Please review it, if this is what you are looking for, I will compile the full list.

Ra (U+0CB0) with halant/virama U+0CCD from a repha in unicode. However, in Kannada script, repha is formation as particular rules. a. Repha can't formed when Ra+halant is at the beginning of the word (or in other words when it is not preceded by another Kannada letter) example: ರ‍್ಯಂಕ್ b. Repha is not formed at the end of the word or with just Ra+halant. example: ಆರ್ , ರ್ c. Repha is not formed when it is followed by Ra example: ರ‍್ರ + any vowel sign d. User can avoid the repha form example: ಸೂರ್ಯ ಸೂರ‍್ಯ All printed Kannada text follows the above rules.

In Unicode: a. This is achieved by adding ZWJ (U+200D) between Ra and halant to stop the repha formation. b. This is achieved by shaping engine, no special characters added. Ra+halant c. This is achieved by the font, no special characters added. Ra+halant+Ra + any vowel sign d. This is achieved by adding ZWJ (U+200D) between Ra and halant to stop the repha formation.

For the OCR, to bridge the printed text to Unicode correctly. a. When the text has Ra with a subscript/consonant conjugate with it in printed text. The output shall be Ra + ZWJ + halant + any consonant but Ra. b. Ra+ halant, no special characters to be inserted. c. When the text has Ra with Ra as subscript/consonant conjugate with it in printed text. The output shall be Ra + halant + Ra + any vowel sign d. Same as above case "a". Ra + ZWJ + halant + any consonant but Ra.

Usage of Zero width non-joiner, ZWNJ (U+200C) Text: ರಾಜ್‌ಕುಮಾರ್ Unicode: U+0CB0 U+0CBE U+0C9C U+0CCD U+200C U+0C95 U+0CC1 U+0CAE U+0CBE U+0CB0 U+0CCD Exception: When "any Kannada consonant" "halant (U+0CCD)" "any Kannada consonant", to avoid subscript/consonant conjugate, ZWNJ U+200C is added after the halant (U+0CCD) If ZWNJ is not present, the above text will be ರಾಜ್ಕುಮಾರ್.
Nukta U+0CBC

Text: ಜಫ಼ಾರ್ ಜಫ಼್ಲರ್ ಜಫ್ಫ಼ರ್ ಜ಼ಾರ್ ಜ್ಜ಼ಾರ್

Image:

Nukta can be with base letter or the subscript. In unicode: Nukta is always placed next to the consonant before the vowel matra or halant. When the nukta is in subscript, it is placed after the consonant in the subscript.

Shreeshrii commented 5 years ago

@MayuraVerma Thanks for your detailed input.

Please see the rules at https://github.com/tesseract-ocr/tesseract/blob/7cc97c25ca32a9e8e7e991587064abae51b22f65/src/training/validate_indic.cpp#L96

and check whether they are OK for Kannada.

tesseract-ocr / tesseract

Need Indic scripts experts to review cleanup code #1038