tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.06k stars 9.5k forks source link

Letters with diacritics both above and below not recognized #1791

Closed Shreeshrii closed 6 years ago

Shreeshrii commented 6 years ago

Letters with diacritics both above and below not recognized / trained.

eg.

ḹ 3 2,30,214,255,54,136,0,32,47,173 Latin 168 0 168 ḹ # ḹ [1e39 ]a Ṝ 5 2,30,241,255,94,189,0,27,117,200 Latin 148 0 147 Ṝ # Ṝ [1e5c ]A ṝ 3 2,30,209,247,63,180,0,35,76,173 Latin 147 0 148 ṝ # ṝ [1e5d ]a

It maybe that code allows only for diacritic either above or below, which should be changed to allow for both also.

https://github.com/tesseract-ocr/tesseract/blob/master/src/textord/tordmain.cpp#L824

    // Above/below refer to word position relative to diacritic. Since some
    // scripts eg Kannada/Telugu habitually put diacritics below words, and
    // others eg Thai/Vietnamese/Latin put most diacritics above words, try
    // for both if there isn't much in it.
    WordWithBox* best_above_word = nullptr;
    WordWithBox* best_below_word = nullptr;
amitdo commented 6 years ago

Did you try plus-minus with one of these letters?

textord is the layout analysis stage, not the text recognizer stage.

The question is if textord is able to correctly segment a line with one or more of these characters.

Shreeshrii commented 6 years ago

Yes, I have tried training with those characters in training text and in unicharset. Probably needs more iterations, if it is not blocked by code.

On Thu 19 Jul, 2018, 11:36 PM Amit D., notifications@github.com wrote:

Did you try plus-minus with one of these letters?

textord is the layout analysis stage, not the text recognizer stage.

The question is if textord is able to correctly segment a line with one or more of these characters.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1791#issuecomment-406365242, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oylYELJK8rgJP9zH9atURuiEvLk0ks5uIMqZgaJpZM4VW0qw .

Shreeshrii commented 6 years ago

Seems to work now with adding that one character with plusminus training. Will test later with multiple character additions and add a layer training.

 lstmeval --model ../tesstutorial/trainplusminus/plusminus_checkpoint --traineddata    ./tesstutorial/trainplusminus/eng/eng.traineddata --eval_listfile  ./tesstutorial/evalplusminus/eng.training_files.txt  --verbosity 2  2>&1 |   grep  ṝ

Truth:Services 12 for Business way PDT the Inc., « bhṝatṝa Mr. use to (3) could ‘the
OCR  :Services 12 for Business way PDT the Inc., « bhṝatṝa Mr. use to (3) could 'the
Truth:found me I very by Articles as a pitṝi Swann Harika matṝu ODBC Query quay
OCR  :found me I very by Articles as a pitṝi Swann Harika matṝu ODBC Query quay
Truth:insider_guru different New Articles page 23 a To patṝa ~~ a details DC that don't
OCR  :insider_guru different New Articles page 23 a To patṝa ~~ a details DC that don't
Truth:in few ṝna be is 24, find with 3 ” University you ṝnm now! good CAPECOD
OCR  :in few ṝna be is 24, find with 3 " University you ṝnm now! good CAPECOD
Truth:also because - my more 24 like TV need matṝi site May May 3. group 28 long any
OCR  :also because - my more 24 like TV need matṝi site May May 3. group 28 long any
Truth:any ṝtu be including # Profile back SUBDUED Coloured GreenBiz.com® Madox
OCR  :any ṝtu be including # Profile back SUBDUED Coloured GreenBiz.com® Madox
Truth:make AUXILIARY 31 which bhartṝnam DVDs out group April including just
OCR  :make AUXILIARY 31 which bhartṝnam DVDs out group April including just
Truth:baṝset :break CEMETERY This! bot Spokeswoman Hands-on ṝta COBLITZ
OCR  :baṝset :break CEMETERY This! bot Spokeswoman Hands-on ṝta COBLITZ
Truth:very . ID will 23, one 2008 | ** ṝte Use Community 17 In Views: look very with
OCR  :very . ID will 23, one 2008 | ** ṝte Use Community 17 In Views: look very with
Truth:between its ṝshi (€ was Travel would (Cited) Help (Fig. NEWS Shopping not not 15¢
OCR  :between its ṝshi (€ was Travel would (Cited) Help (Fig. NEWS Shopping not not 15¢
Truth:m:ṝ RESURFACE Constitution's double diacritics Carey SUBSTRATUM bhṝatṝa
OCR  :m:ṝ RESURFACE Constitution's double diacritics Carey SUBSTRATUM bhṝatṝa
Truth:even dṝshta Coloured Category:Health Carignan Harassing Guillén pitṝi
OCR  :even dṝshta Coloured Category:Health Carignan Harassing Guillén pitṝi
Truth:as your satṝu than to Email our to Advertise and Home message Services it, ṛte
OCR  :as your satṝu than to Email our to Advertise and Home message Services it, rte
Truth:look very with after click because ¢ réalisateurs Consumables Quincy mṝinal
OCR  :look very with after click because ¢ réalisateurs Consumables Quincy mṝinal
Truth:link reflect work 26 kartṝa Support 19 could Size Help Project ṝnamukta
OCR  :link reflect work 26 kartṝa Support 19 could Size Help Project ṝnamukta