tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.6k stars 9.44k forks source link

equation: detection result disappeared #2204

Open johnght opened 5 years ago

johnght commented 5 years ago

As you can see in the photos, Zongyi (Joe) Liu (joeliu@google.com) did a great job in ccmain/equationdetect.cpp. Although the potential is hidden by default, it can be released by setting variables: api->SetVariable("equationdetect_save_merged_image", "T"); api->SetVariable("textord_equation_detect", "T"); which outputs the debug image shown in 'Expected Behavior' below.

However, somehow as shown in 'Current Behavior' below, the equation detection is not considered in the final ScrollView output by default. Programming by API cannot release the potential either e.g. api->Init(NULL, "osd+eng+equ"); //with the equ.traineddata under tessdata

Here comes the bug: the equation detection is fine but the correct result gets filtered?

Environment

Current Behavior:

equ_sv equ

Expected Behavior:

merged

Suggested Fix:

somewhere in the function FindBlocks in textord/colfind.cpp

amitdo commented 5 years ago

Try with oem 0. Set the lang to 'eng'.

johnght commented 5 years ago

As suggested, api->Init(NULL, "eng",OEM_TESSERACT_ONLY); equ_null

The result is better, as you can see, at least the left-hand side of the first equation is preserved and the last line of the second equation is not mistaken as part of the paragraph. The third equation is not taken as paragraph anymore. However, some of the puzzles from the expected behavior image are still missing. In fact, it's the same as a NULL initialization, api->Init(NULL,NULL);

What's happening? Is the proposed result by the equation detection module competing with results from other modules and finally get rejected? Sorry, I can only trace the problem up to FindBlocks in textord/colfind.cpp. Digging further into the internal seems kinda veteran.

amitdo commented 5 years ago

However, some of the puzzles from the expected behavior image are still missing. In fact, it's the same as a NULL initialization, api->Init(NULL,NULL);

With NULL as parameter this function will use 'eng'.

johnght commented 5 years ago

Thanks for the clarification. There maybe glitches in my test program. Let's focus on the scrollview. As suggested, $ tesseract --oem 0 -l eng fullview.jpg segdb segdemo inter equ_sv_blk

Another debug image by api->SetVariable("equationdetect_save_spt_image", "T"); spt

Tesseract main program ignores something obvious caught by the equation detection module e.g. summations present sometimes, equation boxes with no collision with others. It should be a bug.

amitdo commented 5 years ago

Does the blocks bboxes in hocr match the image you get in 'Expected Behavior"?

Please upload the original image.

Also, upload the output of:

tesseract in.png out --oem 0 -c textord_equation_detect=1 txt hocr

tesseract in.png out --oem 1 -c textord_equation_detect=1 txt hocr

johnght commented 5 years ago

No, the bboxes in hocr is still in 'Current Behavior'. The difference between oem 0 and 1 is the minor improvement mentioned above https://github.com/tesseract-ocr/tesseract/issues/2204#issuecomment-458118844

Original from the book CLRS fullview

tesseract in.png out --oem 0 -c textord_equation_detect=1 txt hocr out0.txt out0.hocr.txt

tesseract in.png out --oem 1 -c textord_equation_detect=1 txt hocr out1.txt out1.hocr.txt

weslleyt commented 4 years ago

Dear @johnght,

Did you manage to fix it? I need to detect equations too, but it seems tesseract is not working properly. Actually I can't find a clear documentation in how to proceed to "unlock" this feature in tesseract.