tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
836 stars 887 forks source link

Add U+02BC to Devanagari.unicharset #34

Open Shreeshrii opened 7 years ago

Shreeshrii commented 7 years ago

Some languages of India make use of U+02BC “ ’ ” modifier letter apostrophe, either as a tone mark or as a length mark in their texts written in Devanagari script.

eg. ख’ल्ल ित’लकना दख’ना खर’ कत’ पड़ा’ गेल’?

Shreeshrii commented 7 years ago

See https://github.com/tesseract-ocr/tesseract/issues/561 for a list of fonts and links to the ttf files, that can be used for Devanagari training.

theraysmith commented 7 years ago

The examples you give are all U+2019, so which is it? 2019 or 2bc?

Shreeshrii commented 7 years ago

As per http://www.unicode.org/versions/Unicode9.0.0/ch12.pdf page 21, it is U+02BC.

http://www.fileformat.info/info/unicode/char/02bc/index.htm quotes the folowing regarding U+02BC

"Comments
apostrophe glottal stop, glottalization, ejective many languages use this as a letter of their alphabets used as a tone marker in Bodo, Dogri, and Maithili U+2019 is the preferred character for a punctuation apostrophe"

In terms of Tesseract, it would apply to 'bih' traineddata as Bihari group of languages written in Devanagari scrpt includes Maithili.

It is quite possible that the examples that I had copied used the wrong apostrophe.