Open alevillard opened 6 years ago
I have the same problem with arabic language Run Tesseract for Training [K:\train tesseract\jTessBoxEditor\tesseract-ocr/tesseract, ara.mylotus.exp0.tif, ara.mylotus.exp0, box.train] Tesseract Open Source OCR Engine v4.0.0-alpha.20170804 with Leptonica Page 1 row xheight=23, but median xheight = 30.5 APPLY_BOXES: boxfile line 6/ق ((2324,3143),(2338,3173)): FAILURE! Couldn't find a matching blob APPLY_BOXES: boxfile line 7/ع ((2303,3119),(2334,3157)): FAILURE! Couldn't find a matching blob .... .. . .
APPLY_BOXES: Boxes read from boxfile: 888 Boxes failed resegmentation: 176
For Arabic, you will get better results using tesseract 4.0alpha.
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Mon, Oct 9, 2017 at 10:26 PM, idrisalshikh notifications@github.com wrote:
I have the same problem with arabic language Run Tesseract for Training [K:\train tesseract\jTessBoxEditor\tesseract-ocr/tesseract, ara.mylotus.exp0.tif, ara.mylotus.exp0, box.train] Tesseract Open Source OCR Engine v4.0.0-alpha.20170804 with Leptonica Page 1 row xheight=23, but median xheight = 30.5 APPLY_BOXES: boxfile line 6/ق ((2324,3143),(2338,3173)): FAILURE! Couldn't find a matching blob APPLY_BOXES: boxfile line 7/ع ((2303,3119),(2334,3157)): FAILURE! Couldn't find a matching blob .... .. . .
APPLY_BOXES: Boxes read from boxfile: 888 Boxes failed resegmentation: 176
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1166#issuecomment-335217245, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o7ccW2V5QU-yptOQIhVH4AJ0NLwRks5sqlA9gaJpZM4Pys0D .
Actually i'm already using v 4 as it showing in the training message log Tesseract Open Source OCR Engine v4.0.0-alpha.20170804 with Leptonica
I have the same problem. In my case I try to train digits from a display.
tesseract day_2_60_0_G3.cfont1.exp0.tif day_2_60_0_G3.cfont1.exp0 -l dianoche2 -psm 7 nobatch box.train
Tesseract Open Source OCR Engine v3.02 with Leptonica
FAIL! APPLY_BOXES: boxfile line 90/. ((1079,11),(1081,17)): FAILURE! Couldn't find a matching blob APPLY_BOXES: boxfile line 141/. ((1758,2),(1762,16)): FAILURE! Couldn't find a matching blob APPLY_BOXES: Boxes read from boxfile: 189 Boxes failed resegmentation: 2 Found 187 good blobs. Leaving 1 unlabelled blobs in 0 words. TRAINING ... Font name = cfont1 Generated training data for 60 words day_2_60_0_G3.cfont1.exp0.txt
I attach the file in .txt format because I couldn't attach in .box format
Anyone could help me? Thanks.
These errors have existed for a long time. I think it is a problem with how tesseract segments the page and finds lines. If you only have a couple of these errors, I would say to ignore them and proceed to next step.
hi man, did you solve it ?
If I remember well, if you try to train only the characters with box segmentation problem it goes well. Then, for the training I give both separated files to create single dictionary.
@alevillard can you give me a little more detailed information? how to 'with box segmentation'? specify some arguments?
Hi, I try to answer..
I mean problem of segmentation when tesseract can not find matching blob during the training:
Tesseract Open Source OCR Engine v4.0.0-alpha.20170804 with Leptonica Page 1 FAIL! APPLY_BOXES:boxfile line 45/四 ((1061,3024),(1124,3082)): FAILURE! Couldn't find a matching blob FAIL! .. Boxes failed resegmentation: 2
Isolating those characters in another file .box and .tif sometimes tesseract success in training those characters. So, when you have file1.box , file1.tif with a set of characters and another file2.box and file2.tif in the same folder, JTessBoxEditor can join the charset of both files in a single dictionary.
@amitdo sir, after our team tracking the source code, we found a logical bug when getting the *.tr file by running command "tesseract chi.font.exp0.tif chi.font.exp0 nobatch box.train".
the program flow:main()->ProcessPages()->ProcessPageInternal()->ProcessPage()->Recognize()->ApplyBoxes()->ResegmentCharBox(), we found "logical bug" in ResegmentCharBox() function.
you will call for bounding_box().major_overlap() to judge a box(from box file) whether reasonable or not, here is code:
inline bool TBOX::major_overlap( // Do boxes overlap more that half. const TBOX &box) const { int overlap = MIN(box.top_right.x(), top_right.x()); overlap -= MAX(box.bot_left.x(), bot_left.x()); overlap += overlap; if (overlap < MIN(box.width(), width())) return false; overlap = MIN(box.top_right.y(), top_right.y()); overlap -= MAX(box.bot_left.y(), bot_left.y()); overlap += overlap; if (overlap < MIN(box.height(), height())) return false; return true; }
don't you think this step unnecessary? since we have already prepared a good *.box file(checked/modified by jTessBoxEditor), this step will filter out the useful box information. and more, we guess you get the "blob_box" through 3rd-party leptonica, but as far as we test, it couldnt guarantee a good effect.
The attached zip is the test image and box file
run cmd: tesseract temp.tif temp nobatch box.train
you can see many blobs missing.
@GitHubGS,
I'm not a core developer, and I have no answer to your question, sorry.
I'm not a core developer, and I have no answer to your question, sorry.
@amitdo anyway, many thanks to you and your team for your brilliant work!
Duplicates
I suggest that we close the older issues since this has the most discussion.
@GitHubGS : Hi, I have encountered similar problem while training tesseract. In the code that you have mentioned, I understand that parameters with prefix 'box' are for the box as defined in boxfile.
MIN(box.top_right.x(), top_right.x())
For e.g. here the first parameter is box's top right corner's x-coordinate. But what is top_right.x()?
Is it for the detected blob?
Best Regards
same problem
can anyone say that what is the width and height of tile should be given while executing openalpr-utils-prepcharsfortraining
same problem
The problem is still sharp
same problem 😢
same problem cry
you may could add -l chi_tra
to resolve it.
same problem
I think I found one solution. The reason is that tesseract did unnecessary page analysis. Modify two point to use only box information from box file.
Rebuild tesseract.exe like the following steps
Good Luck
Hello i am new to tesseract. i am working on bengali language [kalpurush font]. I got lots of error when i make TR files. if i describe my work flow At first i create text file in utf-8 format. in those text file i put some Bengali word which is obviously in kalpurush font. then i create box files and tif files with help of Jtessboxeditor. then when i execute this command [ tesseract ben.kalpurush.exp0.tif ben.kalpurush.exp0 box.train ] it gives me error like......could not find a matching blob......box failed resegmentation. Suppose in my file there is 600 word it found only 300 good blobs. i attached a schreenshot. Do i have to change any config for Bengali language. Can anyone tell me or suggest me what to do. i cant find any way to resolve this problem?
I have the same problom
Try the attached traineddata finetuned for Kalpurush font with latest version of tesseract and let me know of its results.
On Wed, Jan 20, 2021 at 12:50 AM FarhanAhmed8 notifications@github.com wrote:
Hello i am new to tesseract. i am working on bengali language [kalpurush font].
@FarhanAhmed8
For Bengali, you need to train the LSTM model. Legacy model training won't work.
Try the attached traineddata finetuned for Kalpurush font with latest version of tesseract and let me know of its results.
Email did not post the file. Attaching again.
I think I found one solution. The reason is that tesseract did unnecessary page analysis. Modify two point to use only box information from box file.
Rebuild tesseract.exe like the following steps
- tesseract::PageSegMode pagesegmode = tesseract::PSM_SINGLE_BLOCK; in main() of tesseractmain.cpp.
- block 2 lines in ResegmentCharBox() of applybox.cpp using comment. / /if (!word_res->box_word->bounding_box().major_overlap(box)) // continue;
Good Luck
Hello, I am sorry but I couldn't find the tesseractmain.cpp, where can I find it?
I think I found one solution. The reason is that tesseract did unnecessary page analysis. Modify two point to use only box information from box file. Rebuild tesseract.exe like the following steps
- tesseract::PageSegMode pagesegmode = tesseract::PSM_SINGLE_BLOCK; in main() of tesseractmain.cpp.
- block 2 lines in ResegmentCharBox() of applybox.cpp using comment. / /if (!word_res->box_word->bounding_box().major_overlap(box)) // continue;
Good Luck
Hello, I am sorry but I couldn't find the tesseractmain.cpp, where can I find it?
./src/api
I have the same problom. its make me crazy
@amitdo I suggest that you also add a legacy
tag for issues related to the old non-neural network tesseract engine.
I think I found one solution. The reason is that tesseract did unnecessary page analysis. Modify two point to use only box information from box file.
Rebuild tesseract.exe like the following steps
1. tesseract::PageSegMode pagesegmode = tesseract::PSM_SINGLE_BLOCK; in main() of tesseractmain.cpp. 2. block 2 lines in ResegmentCharBox() of applybox.cpp using comment. / /if (!word_res->box_word->bounding_box().major_overlap(box)) // continue;
Good Luck
Sounds like a good solution but I can't find the tesseract files (.cpp) on my computer (I don't have a src folder...). I just installed Tesseract with sudo apt-get install tesseract-ocr and I didn't really create an exe
I think I found one solution. The reason is that tesseract did unnecessary page analysis. Modify two point to use only box information from box file.
Rebuild tesseract.exe like the following steps
- tesseract::PageSegMode pagesegmode = tesseract::PSM_SINGLE_BLOCK; in main() of tesseractmain.cpp.
- block 2 lines in ResegmentCharBox() of applybox.cpp using comment. / /if (!word_res->box_word->bounding_box().major_overlap(box)) // continue;
Good Luck
It succeeds with --psm 6
, thank you!
@sinall l did you manage to solve the issue cocnerning the "Failure! Could not find matching blobs" ?
I think I found one solution. The reason is that tesseract did unnecessary page analysis. Modify two point to use only box information from box file.
Rebuild tesseract.exe like the following steps
- tesseract::PageSegMode pagesegmode = tesseract::PSM_SINGLE_BLOCK; in main() of tesseractmain.cpp.
- block 2 lines in ResegmentCharBox() of applybox.cpp using comment. / /if (!word_res->box_word->bounding_box().major_overlap(box)) // continue;
Good Luck
Thank you, it solve my problem.
same problem and the mentioned "tesseract::PSM_SINGLE_BLOCK" method doesn't help...
I am building OCR for Konkani . While creating .tr files I am getting following errors APPLY_BOXES: boxfile line 391/र ((156,1174),(166,1197)): FAILURE! Couldn't find a matching blob APPLY_BOXES: boxfile line 392/े ((146,1174),(193,1197)): FAILURE! Couldn't find a matching blob APPLY_BOXES: boxfile line 393/र ((175,1174),(185,1197)): FAILURE! Couldn't find a matching blob APPLY_BOXES: boxfile line 397/ं ((204,1173),(249,1195)): FAILURE! Couldn't find a matching blob
Please guide me
Hmm, it seems like everyone is stuck here like me with no solution!! Praying that some Messiah will come to rescue us.
If you see this error:
The Messiah has come:
Hi, I'm trying to train a new tesseract chinese dictionary using jTessBoxEditor. The tool creates all files necessary to train tesseract. I have 273 character to train. During the training I have this error for only two character of them:
Moving generated traineddata file to tessdata folder Training Completed Run Tesseract for Training [C:\Users\allvilardi\Downloads\jTessBoxEditorFX-2.0-Beta\jTessBoxEditorFX\tesseract-ocr/tesseract, CT_calibri.calibri.exp0.tif, CT_calibri.calibri.exp0, box.train] Tesseract Open Source OCR Engine v4.0.0-alpha.20170804 with Leptonica Page 1 FAIL! APPLY_BOXES: boxfile line 45/四 ((1061,3024),(1124,3082)): FAILURE! Couldn't find a matching blob FAIL! APPLY_BOXES: boxfile line 125/盤 ((1092,2680),(1164,2751)): FAILURE! Couldn't find a matching blob APPLY_BOXES: Boxes read from boxfile: 273 Boxes failed resegmentation: 2 Found 271 good blobs. Generated training data for 28 words
I've changed also the box manually on those two charachter, but without success. On a Box gui, the boxes seems to be fine. Does anyone know how to fix that problem? ps. I have this error also on korean characters, for all the characters.
This are the grafic boxes on those character:
Anyone could help me? Thanks.