Closed Shreeshrii closed 6 years ago
Did you try plus-minus with one of these letters?
textord is the layout analysis stage, not the text recognizer stage.
The question is if textord is able to correctly segment a line with one or more of these characters.
Yes, I have tried training with those characters in training text and in unicharset. Probably needs more iterations, if it is not blocked by code.
On Thu 19 Jul, 2018, 11:36 PM Amit D., notifications@github.com wrote:
Did you try plus-minus with one of these letters?
textord is the layout analysis stage, not the text recognizer stage.
The question is if textord is able to correctly segment a line with one or more of these characters.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1791#issuecomment-406365242, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oylYELJK8rgJP9zH9atURuiEvLk0ks5uIMqZgaJpZM4VW0qw .
Seems to work now with adding that one character with plusminus training. Will test later with multiple character additions and add a layer training.
lstmeval --model ../tesstutorial/trainplusminus/plusminus_checkpoint --traineddata ./tesstutorial/trainplusminus/eng/eng.traineddata --eval_listfile ./tesstutorial/evalplusminus/eng.training_files.txt --verbosity 2 2>&1 | grep ṝ
Truth:Services 12 for Business way PDT the Inc., « bhṝatṝa Mr. use to (3) could ‘the
OCR :Services 12 for Business way PDT the Inc., « bhṝatṝa Mr. use to (3) could 'the
Truth:found me I very by Articles as a pitṝi Swann Harika matṝu ODBC Query quay
OCR :found me I very by Articles as a pitṝi Swann Harika matṝu ODBC Query quay
Truth:insider_guru different New Articles page 23 a To patṝa ~~ a details DC that don't
OCR :insider_guru different New Articles page 23 a To patṝa ~~ a details DC that don't
Truth:in few ṝna be is 24, find with 3 ” University you ṝnm now! good CAPECOD
OCR :in few ṝna be is 24, find with 3 " University you ṝnm now! good CAPECOD
Truth:also because - my more 24 like TV need matṝi site May May 3. group 28 long any
OCR :also because - my more 24 like TV need matṝi site May May 3. group 28 long any
Truth:any ṝtu be including # Profile back SUBDUED Coloured GreenBiz.com® Madox
OCR :any ṝtu be including # Profile back SUBDUED Coloured GreenBiz.com® Madox
Truth:make AUXILIARY 31 which bhartṝnam DVDs out group April including just
OCR :make AUXILIARY 31 which bhartṝnam DVDs out group April including just
Truth:baṝset :break CEMETERY This! bot Spokeswoman Hands-on ṝta COBLITZ
OCR :baṝset :break CEMETERY This! bot Spokeswoman Hands-on ṝta COBLITZ
Truth:very . ID will 23, one 2008 | ** ṝte Use Community 17 In Views: look very with
OCR :very . ID will 23, one 2008 | ** ṝte Use Community 17 In Views: look very with
Truth:between its ṝshi (€ was Travel would (Cited) Help (Fig. NEWS Shopping not not 15¢
OCR :between its ṝshi (€ was Travel would (Cited) Help (Fig. NEWS Shopping not not 15¢
Truth:m:ṝ RESURFACE Constitution's double diacritics Carey SUBSTRATUM bhṝatṝa
OCR :m:ṝ RESURFACE Constitution's double diacritics Carey SUBSTRATUM bhṝatṝa
Truth:even dṝshta Coloured Category:Health Carignan Harassing Guillén pitṝi
OCR :even dṝshta Coloured Category:Health Carignan Harassing Guillén pitṝi
Truth:as your satṝu than to Email our to Advertise and Home message Services it, ṛte
OCR :as your satṝu than to Email our to Advertise and Home message Services it, rte
Truth:look very with after click because ¢ réalisateurs Consumables Quincy mṝinal
OCR :look very with after click because ¢ réalisateurs Consumables Quincy mṝinal
Truth:link reflect work 26 kartṝa Support 19 could Size Help Project ṝnamukta
OCR :link reflect work 26 kartṝa Support 19 could Size Help Project ṝnamukta
Letters with diacritics both above and below not recognized / trained.
eg.
ḹ 3 2,30,214,255,54,136,0,32,47,173 Latin 168 0 168 ḹ # ḹ [1e39 ]a Ṝ 5 2,30,241,255,94,189,0,27,117,200 Latin 148 0 147 Ṝ # Ṝ [1e5c ]A ṝ 3 2,30,209,247,63,180,0,35,76,173 Latin 147 0 148 ṝ # ṝ [1e5d ]a
It maybe that code allows only for diacritic either above or below, which should be changed to allow for both also.
https://github.com/tesseract-ocr/tesseract/blob/master/src/textord/tordmain.cpp#L824