tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.11k stars 9.4k forks source link

Training tesseract, APPLY_BOXES: ... FAILURE! Couldn't find a matching blob #1166

Open alevillard opened 6 years ago

alevillard commented 6 years ago

Hi, I'm trying to train a new tesseract chinese dictionary using jTessBoxEditor. The tool creates all files necessary to train tesseract. I have 273 character to train. During the training I have this error for only two character of them:

Moving generated traineddata file to tessdata folder Training Completed Run Tesseract for Training [C:\Users\allvilardi\Downloads\jTessBoxEditorFX-2.0-Beta\jTessBoxEditorFX\tesseract-ocr/tesseract, CT_calibri.calibri.exp0.tif, CT_calibri.calibri.exp0, box.train] Tesseract Open Source OCR Engine v4.0.0-alpha.20170804 with Leptonica Page 1 FAIL! APPLY_BOXES: boxfile line 45/四 ((1061,3024),(1124,3082)): FAILURE! Couldn't find a matching blob FAIL! APPLY_BOXES: boxfile line 125/盤 ((1092,2680),(1164,2751)): FAILURE! Couldn't find a matching blob APPLY_BOXES: Boxes read from boxfile: 273 Boxes failed resegmentation: 2 Found 271 good blobs. Generated training data for 28 words

I've changed also the box manually on those two charachter, but without success. On a Box gui, the boxes seems to be fine. Does anyone know how to fix that problem? ps. I have this error also on korean characters, for all the characters.

This are the grafic boxes on those character:

image

Anyone could help me? Thanks.

idrisalshikh commented 6 years ago

I have the same problem with arabic language Run Tesseract for Training [K:\train tesseract\jTessBoxEditor\tesseract-ocr/tesseract, ara.mylotus.exp0.tif, ara.mylotus.exp0, box.train] Tesseract Open Source OCR Engine v4.0.0-alpha.20170804 with Leptonica Page 1 row xheight=23, but median xheight = 30.5 APPLY_BOXES: boxfile line 6/ق ((2324,3143),(2338,3173)): FAILURE! Couldn't find a matching blob APPLY_BOXES: boxfile line 7/ع ((2303,3119),(2334,3157)): FAILURE! Couldn't find a matching blob .... .. . .

APPLY_BOXES: Boxes read from boxfile: 888 Boxes failed resegmentation: 176

Shreeshrii commented 6 years ago

For Arabic, you will get better results using tesseract 4.0alpha.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Mon, Oct 9, 2017 at 10:26 PM, idrisalshikh notifications@github.com wrote:

I have the same problem with arabic language Run Tesseract for Training [K:\train tesseract\jTessBoxEditor\tesseract-ocr/tesseract, ara.mylotus.exp0.tif, ara.mylotus.exp0, box.train] Tesseract Open Source OCR Engine v4.0.0-alpha.20170804 with Leptonica Page 1 row xheight=23, but median xheight = 30.5 APPLY_BOXES: boxfile line 6/ق ((2324,3143),(2338,3173)): FAILURE! Couldn't find a matching blob APPLY_BOXES: boxfile line 7/ع ((2303,3119),(2334,3157)): FAILURE! Couldn't find a matching blob .... .. . .

APPLY_BOXES: Boxes read from boxfile: 888 Boxes failed resegmentation: 176

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1166#issuecomment-335217245, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o7ccW2V5QU-yptOQIhVH4AJ0NLwRks5sqlA9gaJpZM4Pys0D .

idrisalshikh commented 6 years ago

Actually i'm already using v 4 as it showing in the training message log Tesseract Open Source OCR Engine v4.0.0-alpha.20170804 with Leptonica

Shreeshrii commented 6 years ago

Same issue as

https://github.com/tesseract-ocr/tesseract/issues/436 https://github.com/tesseract-ocr/tesseract/issues/445 https://github.com/tesseract-ocr/tesseract/issues/1033

iareizaga commented 6 years ago

I have the same problem. In my case I try to train digits from a display.

tesseract day_2_60_0_G3.cfont1.exp0.tif day_2_60_0_G3.cfont1.exp0 -l dianoche2 -psm 7 nobatch box.train

Tesseract Open Source OCR Engine v3.02 with Leptonica

FAIL! APPLY_BOXES: boxfile line 90/. ((1079,11),(1081,17)): FAILURE! Couldn't find a matching blob APPLY_BOXES: boxfile line 141/. ((1758,2),(1762,16)): FAILURE! Couldn't find a matching blob APPLY_BOXES: Boxes read from boxfile: 189 Boxes failed resegmentation: 2 Found 187 good blobs. Leaving 1 unlabelled blobs in 0 words. TRAINING ... Font name = cfont1 Generated training data for 60 words day_2_60_0_g3 cfont1 exp0 day_2_60_0_G3.cfont1.exp0.txt

I attach the file in .txt format because I couldn't attach in .box format

Anyone could help me? Thanks.

Shreeshrii commented 6 years ago

These errors have existed for a long time. I think it is a problem with how tesseract segments the page and finds lines. If you only have a couple of these errors, I would say to ignore them and proceed to next step.

gbolin commented 6 years ago

hi man, did you solve it ?

alevillard commented 6 years ago

If I remember well, if you try to train only the characters with box segmentation problem it goes well. Then, for the training I give both separated files to create single dictionary.

gbolin commented 6 years ago

@alevillard can you give me a little more detailed information? how to 'with box segmentation'? specify some arguments?

alevillard commented 6 years ago

Hi, I try to answer..

I mean problem of segmentation when tesseract can not find matching blob during the training:

Tesseract Open Source OCR Engine v4.0.0-alpha.20170804 with Leptonica Page 1 FAIL! APPLY_BOXES:boxfile line 45/四 ((1061,3024),(1124,3082)): FAILURE! Couldn't find a matching blob FAIL! .. Boxes failed resegmentation: 2

Isolating those characters in another file .box and .tif sometimes tesseract success in training those characters. So, when you have file1.box , file1.tif with a set of characters and another file2.box and file2.tif in the same folder, JTessBoxEditor can join the charset of both files in a single dictionary.

gbolin commented 6 years ago

@amitdo sir, after our team tracking the source code, we found a logical bug when getting the *.tr file by running command "tesseract chi.font.exp0.tif chi.font.exp0 nobatch box.train".

the program flow:main()->ProcessPages()->ProcessPageInternal()->ProcessPage()->Recognize()->ApplyBoxes()->ResegmentCharBox(), we found "logical bug" in ResegmentCharBox() function. you will call for bounding_box().major_overlap() to judge a box(from box file) whether reasonable or not, here is code: inline bool TBOX::major_overlap( // Do boxes overlap more that half. const TBOX &box) const { int overlap = MIN(box.top_right.x(), top_right.x()); overlap -= MAX(box.bot_left.x(), bot_left.x()); overlap += overlap; if (overlap < MIN(box.width(), width())) return false; overlap = MIN(box.top_right.y(), top_right.y()); overlap -= MAX(box.bot_left.y(), bot_left.y()); overlap += overlap; if (overlap < MIN(box.height(), height())) return false; return true; } don't you think this step unnecessary? since we have already prepared a good *.box file(checked/modified by jTessBoxEditor), this step will filter out the useful box information. and more, we guess you get the "blob_box" through 3rd-party leptonica, but as far as we test, it couldnt guarantee a good effect. The attached zip is the test image and box file run cmd: tesseract temp.tif temp nobatch box.train you can see many blobs missing.

Archive.zip

amitdo commented 6 years ago

@GitHubGS,

I'm not a core developer, and I have no answer to your question, sorry.

gbolin commented 6 years ago

I'm not a core developer, and I have no answer to your question, sorry.

@amitdo anyway, many thanks to you and your team for your brilliant work!

Shreeshrii commented 6 years ago

Duplicates

436

445

1033

I suggest that we close the older issues since this has the most discussion.

MehulBhardwaj91 commented 6 years ago

@GitHubGS : Hi, I have encountered similar problem while training tesseract. In the code that you have mentioned, I understand that parameters with prefix 'box' are for the box as defined in boxfile. MIN(box.top_right.x(), top_right.x()) For e.g. here the first parameter is box's top right corner's x-coordinate. But what is top_right.x()? Is it for the detected blob? Best Regards

iterateself commented 5 years ago

same problem

aniethomas commented 5 years ago

can anyone say that what is the width and height of tile should be given while executing openalpr-utils-prepcharsfortraining

bambooj commented 5 years ago

same problem

Simakvokka commented 5 years ago

The problem is still sharp

guohao commented 5 years ago

same problem 😢

a1094426901 commented 4 years ago

same problem cry

jinchenxiangdan commented 4 years ago

you may could add -l chi_tra to resolve it.

CharlieGit commented 4 years ago

same problem

jswlovers commented 3 years ago

I think I found one solution. The reason is that tesseract did unnecessary page analysis. Modify two point to use only box information from box file.

Rebuild tesseract.exe like the following steps

  1. tesseract::PageSegMode pagesegmode = tesseract::PSM_SINGLE_BLOCK; in main() of tesseractmain.cpp.
  2. block 2 lines in ResegmentCharBox() of applybox.cpp using comment. / /if (!word_res->box_word->bounding_box().major_overlap(box)) // continue;

Good Luck

FarhanAhmed8 commented 3 years ago

Hello i am new to tesseract. i am working on bengali language [kalpurush font]. I got lots of error when i make TR files. if i describe my work flow At first i create text file in utf-8 format. in those text file i put some Bengali word which is obviously in kalpurush font. then i create box files and tif files with help of Jtessboxeditor. then when i execute this command [ tesseract ben.kalpurush.exp0.tif ben.kalpurush.exp0 box.train ] it gives me error like......could not find a matching blob......box failed resegmentation. Suppose in my file there is 600 word it found only 300 good blobs. i attached a schreenshot. Do i have to change any config for Bengali language. Can anyone tell me or suggest me what to do. i cant find any way to resolve this problem? Screenshot_162

cediy2088 commented 3 years ago

I have the same problom

Shreeshrii commented 3 years ago

Try the attached traineddata finetuned for Kalpurush font with latest version of tesseract and let me know of its results.

On Wed, Jan 20, 2021 at 12:50 AM FarhanAhmed8 notifications@github.com wrote:

Hello i am new to tesseract. i am working on bengali language [kalpurush font].

Shreeshrii commented 3 years ago

@FarhanAhmed8

For Bengali, you need to train the LSTM model. Legacy model training won't work.

Try the attached traineddata finetuned for Kalpurush font with latest version of tesseract and let me know of its results.

Email did not post the file. Attaching again.

benKalpurush_0.049_1720_17200.zip

ahmedelbehery99 commented 3 years ago

I think I found one solution. The reason is that tesseract did unnecessary page analysis. Modify two point to use only box information from box file.

Rebuild tesseract.exe like the following steps

  1. tesseract::PageSegMode pagesegmode = tesseract::PSM_SINGLE_BLOCK; in main() of tesseractmain.cpp.
  2. block 2 lines in ResegmentCharBox() of applybox.cpp using comment. / /if (!word_res->box_word->bounding_box().major_overlap(box)) // continue;

Good Luck

Hello, I am sorry but I couldn't find the tesseractmain.cpp, where can I find it?

mtclaw commented 3 years ago

I think I found one solution. The reason is that tesseract did unnecessary page analysis. Modify two point to use only box information from box file. Rebuild tesseract.exe like the following steps

  1. tesseract::PageSegMode pagesegmode = tesseract::PSM_SINGLE_BLOCK; in main() of tesseractmain.cpp.
  2. block 2 lines in ResegmentCharBox() of applybox.cpp using comment. / /if (!word_res->box_word->bounding_box().major_overlap(box)) // continue;

Good Luck

Hello, I am sorry but I couldn't find the tesseractmain.cpp, where can I find it?

./src/api

xyzhuofeng commented 3 years ago

I have the same problom. its make me crazy

Shreeshrii commented 3 years ago

@amitdo I suggest that you also add a legacy tag for issues related to the old non-neural network tesseract engine.

Duboislo commented 3 years ago

I think I found one solution. The reason is that tesseract did unnecessary page analysis. Modify two point to use only box information from box file.

Rebuild tesseract.exe like the following steps

1. tesseract::PageSegMode pagesegmode = tesseract::PSM_SINGLE_BLOCK; in main() of  tesseractmain.cpp.

2. block 2 lines in ResegmentCharBox() of  applybox.cpp using comment.
   / /if (!word_res->box_word->bounding_box().major_overlap(box))
   //      continue;

Good Luck

Sounds like a good solution but I can't find the tesseract files (.cpp) on my computer (I don't have a src folder...). I just installed Tesseract with sudo apt-get install tesseract-ocr and I didn't really create an exe

sinall commented 3 years ago

I think I found one solution. The reason is that tesseract did unnecessary page analysis. Modify two point to use only box information from box file.

Rebuild tesseract.exe like the following steps

  1. tesseract::PageSegMode pagesegmode = tesseract::PSM_SINGLE_BLOCK; in main() of tesseractmain.cpp.
  2. block 2 lines in ResegmentCharBox() of applybox.cpp using comment. / /if (!word_res->box_word->bounding_box().major_overlap(box)) // continue;

Good Luck

It succeeds with --psm 6, thank you!

wolfassi123 commented 2 years ago

@sinall l did you manage to solve the issue cocnerning the "Failure! Could not find matching blobs" ?

yourskc commented 2 years ago

I think I found one solution. The reason is that tesseract did unnecessary page analysis. Modify two point to use only box information from box file.

Rebuild tesseract.exe like the following steps

  1. tesseract::PageSegMode pagesegmode = tesseract::PSM_SINGLE_BLOCK; in main() of tesseractmain.cpp.
  2. block 2 lines in ResegmentCharBox() of applybox.cpp using comment. / /if (!word_res->box_word->bounding_box().major_overlap(box)) // continue;

Good Luck

Thank you, it solve my problem.

SpaceView commented 2 years ago

same problem and the mentioned "tesseract::PSM_SINGLE_BLOCK" method doesn't help...

tejakundaikar commented 1 year ago

I am building OCR for Konkani . While creating .tr files I am getting following errors APPLY_BOXES: boxfile line 391/र ((156,1174),(166,1197)): FAILURE! Couldn't find a matching blob APPLY_BOXES: boxfile line 392/े ((146,1174),(193,1197)): FAILURE! Couldn't find a matching blob APPLY_BOXES: boxfile line 393/र ((175,1174),(185,1197)): FAILURE! Couldn't find a matching blob APPLY_BOXES: boxfile line 397/ं ((204,1173),(249,1195)): FAILURE! Couldn't find a matching blob

Please guide me

nandlalkumar commented 12 months ago

Hmm, it seems like everyone is stuck here like me with no solution!! Praying that some Messiah will come to rescue us.

zdenop commented 12 months ago

If you see this error:

The Messiah has come: