Open Shreeshrii opened 7 years ago
See https://github.com/tesseract-ocr/tesseract/issues/561 for a list of fonts and links to the ttf files, that can be used for Devanagari training.
The examples you give are all U+2019, so which is it? 2019 or 2bc?
As per http://www.unicode.org/versions/Unicode9.0.0/ch12.pdf page 21, it is U+02BC.
http://www.fileformat.info/info/unicode/char/02bc/index.htm quotes the folowing regarding U+02BC
"Comments
apostrophe
glottal stop, glottalization, ejective
many languages use this as a letter of their alphabets
used as a tone marker in Bodo, Dogri, and Maithili
U+2019 is the preferred character for a punctuation apostrophe"
In terms of Tesseract, it would apply to 'bih' traineddata as Bihari group of languages written in Devanagari scrpt includes Maithili.
It is quite possible that the examples that I had copied used the wrong apostrophe.
Some languages of India make use of U+02BC “ ’ ” modifier letter apostrophe, either as a tone mark or as a length mark in their texts written in Devanagari script.
eg. ख’ल्ल ित’लकना दख’ना खर’ कत’ पड़ा’ गेल’?