vedicsociety / sanskrit-OCR-Feedback

0 stars 0 forks source link

Devanagari + Latin + Cyrillic #1

Open gasyoun opened 3 years ago

gasyoun commented 3 years ago

Since 2007 I'm submitting errors to https://www.sanskrit-lexicon.uni-koeln.de/ - main source of Sanskrit dictionaries on the net. In 2014 I launched https://github.com/sanskrit-lexicon/ to make error submission public. We add new dictionaries. Now I want to add a few Sanskrit-Russian dictionaries, but they use inermixed languages and Google OCR fails in that even more than Abbyy Fine Reader 12 (published in 2013, all other later versions have even weaker algorithms and a higher level of dirt in output). In 2013 (see http://samskrtam.ru/hellwigs-devanagari-ocr/) I wrote why Hellwig’s Devanagari OCR failed for batch recognition of Sanskrit OCR (1.0.0.9 beta).

Knauer's whole dictionary can be seen at http://samskrtam.ru/sanskrit-lexicon/knauer/ The original book was scanned with 600 dpi, the print is clear. Still the output is worse than 7-10 years ago with desktop software (where I was able to teach and edit patterns myself).

01002

Output:

а са епсі, и, также, даже, ибо, а, же,
въ стихахъ иногда expl., съ арі (иногда и безъ него) далѣе, также, съ Thйуаs (и безъ) еще; са — са и — и (какъ — такъ и, съ одной - съ другой стороны, хотя — но), при отрицаніи ни — ни; са — па са — tu N. 3, 16 хотя — но не — а; сäiva (ca eva) = са или нѣсколько выразительнѣе; cа, обыкн. сапа (са па) при вопросит. мѣстоим. и нарѣч. = нибудь. — [те, лат. que, гот. -1].

Issues:

gasyoun commented 3 years ago

1) devanagari lost to Лha 2) italics lost 3) hyphen at end of line lost

gurutva

Лha гуру-тва, ср. (-твам), 1) тяжесть, — вѣсъ, важность, — достоинство, уваженіе внушаемое лѣтами и нравственнымъ достоинствомъ, — достоинство наставника; — 2) тя гости, горе.

gurutva2