TerryZH commented 7 years ago

Environment

Tesseract Version: v4.00.00dev-692-gad5ee18 with Leptonica
Commit Number: ad5ee18
Platform: MAC OS 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64

Current Behavior:

Line 1, unexpected '__' recognized between 1941 and Ritter, with bbox as the entire page.

sample Corresponding HOCR line: GS 1 2,261,002 Oct. 28,1941 __ Ritter 760 $FO

Expected Behavior:

'__' is not supposed to be recognized in the first place. If the false positive recognition is inevitable, the bbox information should be accurate.

Suggested Fix:

n/a

napasa commented 6 years ago

When does this will be fixed?

Shreeshrii commented 6 years ago

Comparing best and fast OCRed text

While best does not add __ for the vertical bar in table to the OCRed txt fast did not recognize , as part of the number string.

Best

&% 1 2261002 Oct.28,1941 Ritter 260 SFO
(2 2 2211378 Jan. 27,1942 Searle ZB 2%

Fast

GS 1 2,261,002 Oct. 28,1941 __ Ritter 760 $FO
CG (2 2,271,378 Jan. 27,1942 Searle Ke \ 22

@zdenop Please label accuracy

devendrasr commented 6 years ago

@Shreeshrii How can we overcome this issue?

willaaam commented 6 years ago

This is much more than an accuracy error - even for completely accurate words the bounding boxes make no sense at all, whether using TSV or HOCR output. This is pretty big for me.

The weird part is - somewhere internally it seems to know the coordinates as the word_ids are in order LTR and follow up.

Does anyone have any suggestion on where to start looking? I'm happy to hunt this down with some rusty C skills but I really need a pointer, completely unfamiliar with the tesseract codebase.

It happens with the LSTM engine, haven't been able to test with the legacy engine as tesseract won't recognize the retro traineddata file in the tessdata folder. I'll update this post when I get tesseract 4 --oem 0 working.

Still occurring on the very latest master build (e4b9cff)

Example of the error:

   <div class='ocr_carea' id='block_1_15' title="bbox 192 481 2422 1117">
    <p class='ocr_par' id='par_1_16' lang='ita' title="bbox 192 481 2422 1117">
     <span class='ocr_line' id='line_1_19' title="bbox 674 481 1494 532; baseline 0.002 -14; x_size 52; x_descenders 13; x_ascenders 11">
      <span class='ocrx_word' id='word_1_46' title='bbox 674 483 861 532; x_wconf 91'>Sottoposto</span>
      <span class='ocrx_word' id='word_1_47' title='bbox 0 0 2485 3508; x_wconf 96'>a</span>
      <span class='ocrx_word' id='word_1_48' title='bbox 863 481 1494 521; x_wconf 95'>condizione</span>
      <span class='ocrx_word' id='word_1_49' title='bbox 0 0 2485 3508; x_wconf 96'>risolutiva</span>

Also goes wrong when printing the character level info:

c 2485 0 2485 0 0
a 2485 0 2485 0 0
p 2485 0 2485 0 0
i 2485 0 2485 0 0
t 736 1948 742 1954 0
a 2485 0 2485 0 0
l 2485 0 2485 0 0
e 789 1916 795 1956 0
s 2485 0 2485 0 0
o 2485 0 2485 0 0
c 2485 0 2485 0 0
i 928 1950 932 1956 0
a 932 1950 934 1956 0
l 967 1917 969 1957 0
e 969 1917 973 1957 0

Sintun commented 6 years ago

@willaaam I am not sure if it's a problem of the tesseract console program or at the api level. If it's independet from the console program it's probably a variation of #1712. I also started looking at this problem. A pointer in order to track it down: After an OCR run the result information can be extracted through a result Iterator unique_ptr<tesseract::ResultIterator> ri( tess->GetIterator() ); The bounding box of different detail level (for example tesseract::RIL_PARA, tesseract::RIL_TEXTLINE, tesseract::RIL_WORD, tesseract::RIL_SYMBOL also known as paragraph, text line, word, character) can be obtained through

if( ri )
{
  do
  {
    int x1, y1, x2, y2;
    ri->BoundingBox( level, &x1, &y1, &x2, &y2 );
    if( ri->IsAtFinalElement( higher_level, level ) )
      break;
  }
  while( ri->Next( level ) )
}

Now the bounding boxes on all levels are consistent, in some cases consistently false. So I would start with a minimal failing image by tracking down where the information from BoundingBox originates and where it goes wrong. https://tesseract-ocr.github.io/4.0.0/a02399.html#aae57ed588b6bffae18c15bc02fbe4f68

Doing that is also on my ToDo list, but unfortunately i havn't found the time yet. And our Codebase "found" a temporary solution that lead to beautiful function names like

void tesseractBugFixingCharSizePlausibilityCheck();

willaaam commented 6 years ago

Thanks, I appreciate the pointer, let's see if I can make some time to track this down, hopefully with a friend of mine. This bug breaks all analytics applications that come after tesseract.

And I agree, also crossposted in #1712 so we can nip this one in the bud.

willaaam commented 6 years ago

Quick update - we spent some time on this last night and the bug is definitely at the API level unfortunately.

Using the code below we notice that already using BoundingBoxInternal (code below was snapshotted an hour earlier and still using BoundingBox - but same results) we get whole-page coordinates for the boxes.

Inside the BoundingBoxInternal structure, at least for our sample code, cblob_it is always null, so thats where we are going to resume the hunt and check out the BlobBox.

     case RIL_SYMBOL:
       if (cblob_it_ == NULL)
         box = it_->word()->box_word->BlobBox(blob_index_);
       else
         box = cblob_it_->data()->bounding_box();

Sample API test code below:

#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>

int main()
{
    char *outText;

    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
    // Initialize tesseract-ocr with English, without specifying tessdata path
    if (api->Init(NULL, "ita", tesseract::OcrEngineMode::OEM_LSTM_ONLY)) {
        fprintf(stderr, "Could not initialize tesseract.\n");
        exit(1);
    }

    // Open input image with leptonica library
    Pix *image = pixRead("/home/user/000001_nonconf.page.png");
    api->SetPageSegMode(tesseract::PSM_AUTO);
    api->SetImage(image);
    // Get OCR result
    outText = api->GetUTF8Text();
    printf("OCR output:\n%s", outText);

    //hai
  tesseract::ResultIterator* ri = api->GetIterator();
  tesseract::PageIteratorLevel level = tesseract::RIL_WORD;
  if (ri != 0) {
    do {
      const char* word = ri->GetUTF8Text(level);
      float conf = ri->Confidence(level);
      int x1, y1, x2, y2;
      ri->BoundingBox(level, &x1, &y1, &x2, &y2);
      printf("word: '%s';  \tconf: %.2f; BoundingBox: %d,%d,%d,%d;\n",
               word, conf, x1, y1, x2, y2);
      delete[] word;
    } while (ri->Next(level));
  }

    // Destroy used object and release memory
    api->End();
    delete [] outText;
    pixDestroy(&image);

    return 0;
}

Sintun commented 6 years ago

Hey there, I also followed the path and got down the following lane: BoundingBox -> BoundingBoxInternal -> restricted_bounding_box -> true_bounding_box -> TBOX WERD::bounding_box And tracked every usage of WERD::bounding_box,WERD::restricted_bounding_box I saw, that the the bounding boxes are fine until the code reaches Tesseract::RetryWithLanguage https://tesseract-ocr.github.io/4.0.0/a02479.html#a8952ab340e0f5e61992109e85cb1619c

Within this function the recognizer (this->*recognizer)(word_data, in_word, &new_words); (which uses LSTMRecognizeWord) https://tesseract-ocr.github.io/4.0.0/a01743.html#ac50ad7dad904ed14e81cd29a3bfdb82d https://tesseract-ocr.github.io/4.0.0/a02479.html#a0478ee100b826566b0b9ea048eee636e is applied. Before it's application the word and character positions are good. (They are initialized before the lstm runs, so that sane regions can be fed into the lstm) After this the resulting new word rectangles can have negative values & probably loose / gain characters at the word start / end.

Now there seem to be two possibilites

The LSTM itself produces crappy output based on the provided neuronal net.
There exists a post processing issue in or near the LSTM code.

Considering, that

word_data.lang_words[ max_index + 1 ]->word->bounding_box(); results into a bounding box with Points containing +/- 32767 Which usage would result in a boundingBox containing the whole image, after it got cropped down to the image borders,
single character boxes moved one to the left or to the right.

I would hope, that this is a +/- 1 error on some pointer in the LSTM bounding box / blob index post processing.

Next time i will continue tracking the issue within LSTMRecognizeWord .

Update: I'm still closing in on this, reached ExtractBestPathAsWords .

FrkBo commented 6 years ago

Don't know if the following is of any help... If you would comment out the following lines in ccstruct/pageres.cpp on lines 1311-1313 if (blob_it.at_first()) blob_it.set_to_list(next_word_blobs); blob_end = (blob_box.right() + blob_it.data()->bounding_box().left()) / 2;

You would end up with a lot more lines where the bounding box is equal to the entire page. Maybe the previous if-statement >> if (!blob_it.at_first() || next_word_blobs != nullptr) << does not cover all applicable cases?

Update Disabling the following line in pageres.cpp (line 1375) seems to 'solve' the issue or at least give better output for the incorrect bounding boxes, but previous 'correct' bounding boxes are changed (and not for the better...) // Delete the fake blobs on the current word. word_w->word->cblob_list()->clear();

Sintun commented 6 years ago

Hey there, I'm still working on this and traced the issue down to the character positions computed from the LSTM output. Unfortunately it seems to be more than an "one off" error. For now i reached the function RecodeBeamSearch::ExtractBestPaths in recodebeam.cpp Debug output from RecodeBeamSearch::ExtractPathAsUnicharIds shows, that best_nodes[i]->duplicate and best_nodes[i]->unichar_id are wrong, and off by more than one. Using the image test_3 the letters u and s are attributed to the position of u and one of the L s gets the position of both L s. Next time i will test the ExtractBestPaths function and follow the source of the wrong values, it shouldn't be far away. I hope that i find the bug source before reaching the LSTM computations.

willaaam commented 6 years ago

Hey there, I'm still working on this and traced the issue down to the character positions computed from the LSTM output. Unfortunately it seems to be more than an "one off" error. For now i reached the function RecodeBeamSearch::ExtractBestPaths in recodebeam.cpp Debug output from RecodeBeamSearch::ExtractPathAsUnicharIds shows, that best_nodes[i]->duplicate and best_nodes[i]->unichar_id are wrong, and off by more than one. Using the image the letters u and s are attributed to the position of u and one of the L s gets the position of both L s. Next time i will test the ExtractBestPaths function and follow the source of the wrong values, it shouldn't be far away. I hope that i find the bug source before reaching the LSTM computations.

Appreciated Sintun! I've had a bit of a rough week (pet passed away) but I hope to be able to help you out somewhat between tomorrow and wednesday. On thursday I have a 14 hour flight ahead of me so I might be able to help out a bit there. But TBH - we're getting deeper in the code than I would've hoped when starting on this, lol.

FrkBo commented 6 years ago

Hi Sintun,

Maybe there are multiple issues here? If I use your image, the bounding boxes are not correct, but do not span the entire page.

Here's my image: 1545-14489-a4-35

In this example (it is a pretty bad OCR btw, so don't mind the spelling mistakes) the TSV output will show the following incorrect output 4 1 4 1 2 0 208 92 209 9 -1
5 1 4 1 2 1 0 0 612 792 11 PLEASE 5 1 4 1 2 2 0 0 612 792 15 EYFCUTE 5 1 4 1 2 3 0 0 612 792 42 THREE 5 1 4 1 2 4 208 92 209 9 20 ORQUAL

Your image produces the following output level page_num block_num par_num line_num word_num left top width height conf text 1 1 0 0 0 0 0 0 600 70 -1
2 1 1 0 0 0 13 22 314 30 -1
3 1 1 1 0 0 13 22 314 30 -1
4 1 1 1 1 0 13 22 314 30 -1
5 1 1 1 1 1 13 22 290 30 91 thousand 5 1 1 1 1 2 305 29 22 23 90 Billion

Which is also weird because Tesseract did recognize the words thousand and Billion as two separate words. capture

Sintun commented 6 years ago

Hi FrkBo,

it is true, that these are two different Bugs. I'm primarily working on #1712 . I moved my work to this bug, because willaaam theorized, that these two bugs have the same origin. And indeed, i found that an out of bound access to PointerVector< WERD_RES > words like words[ <out_of_bound_index> ]->word->bounding_box(); results into a bounding box with Points containing +/- 32767. Such a bounding box would subsequently be cropped down to the image borders and result into a character having the whole image as it's bounding box. The source of such an out of bound access is very likely the same that creates the bounding box bugs on "thousand Billion".

Nonetheless you are right, since it is a different issue i will keep on posting in #1712 . If i find a solution i will test your image against it and report back.

FrkBo commented 6 years ago

Did coincidently find a 'hacky solution' to your issue...

In the function ComputeBlobEnds change the line for (int b = 0; b < length; ++b) { to for (int b = 0; b < length - 1; ++b) {

Other test output seems much better, but still not perfect.

Shreeshrii commented 6 years ago

@FrkBo Please submit a PR if you have a proposed solution.

current code is:

https://github.com/tesseract-ocr/tesseract/blame/5fdaa479da2c52526dac1281871db5c4bdaff359/src/ccstruct/pageres.cpp#L1303

for (int b = 1; b < length; ++b) {

jbreiden commented 6 years ago

I confirmed the problem with current GitHub code on the "thousand Billion" example. However, the copy of Tesseract inside Google does not have the problem. I can think of two possibilities. One is the GitHub copy had a regression; to check this try building an older version of GitHub LSTM Tesseract and see if has the same trouble. The other possibility is some bugfix from Google did not make it to GitHub. I took a look and have found a couple candidates. Two are listed below, and are good to incorporate into GitHub Tesseract no matter what. If none of this helps, I will spend more time investigating.

jbreiden commented 6 years ago

Fix signed integer overflow. If the left isn't left of the right, make the width -1. Caught with ASAN.

--- tesseract/textord/colpartitiongrid.cpp  2017-07-14 07:32:13.000000000 -0700
+++ tesseract/textord/colpartitiongrid.cpp  2017-11-27 11:17:31.000000000 -0800
@@ -1254,7 +1254,7 @@
   const TBOX& box = part->bounding_box();
   int left = part->median_left();
   int right = part->median_right();
-  int width = right - left;
+  int width = right >= left ? right - left : -1;
   int mid_x = (left + right) / 2;
   ColPartitionGridSearch hsearch(this);
   // Search left for neighbour to_the_left

jbreiden commented 6 years ago

Setting line_size to 1, if a part has median_height/median_weight = 0. UBSan reports undefined behavior when inf is being cast into int. The expression has denominator equal to zero.

--- tesseract/textord/colpartitiongrid.cpp  2017-11-27 11:17:31.000000000 -0800
+++ tesseract/textord/colpartitiongrid.cpp  2018-01-26 13:23:35.000000000 -0800
@@ -722,6 +722,7 @@
         to_block->line_spacing = static_cast<float>(box.height());
         to_block->max_blob_size = static_cast<float>(box.height() + 1);
       }
+      if (to_block->line_size == 0) to_block->line_size = 1;
       block_it.add_to_end(block);
       to_block_it.add_to_end(to_block);
     } else {

Shreeshrii commented 6 years ago

@jbreiden Thank you for finding these fixes.

@zdenop can you please create a PR and commit these patches?

zdenop commented 6 years ago

done.

Zdenko

po 17. 9. 2018 o 19:38 Shreeshrii notifications@github.com napísal(a):

@jbreiden https://github.com/jbreiden Thank you for finding these bug fixes.

@zdenop https://github.com/zdenop can you please create a PR and commit these patches?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1192#issuecomment-422105446, or mute the thread https://github.com/notifications/unsubscribe-auth/AAjCzPwngGjZLEN1N_X7RG9ERBpfswDFks5ub94ogaJpZM4QKZi2 .

willaaam commented 6 years ago

Hi Everyone,

The code probably improved something, but it didn't resolve the issue I described unfortunately. I just git pulled and built the latest master. Used the ita model from tessdata repo.

console command: tesseract 000001_nonconf.page.png - -c tessedit_create_hocr=1 -l ita

git version: commit 5d22fdfeed901ec8b73c2f490ba84da9414f1b79 (HEAD -> master, origin/master, origin/HEAD) Author: Zdenko Podobný <zdenop@gmail.com> Date: Tue Sep 18 18:51:11 2018 +0200

Sample picture included below: 000001_nonconf page

And the output: output.txt

@jbreiden if you pull this picture through the google version of the repo, do you see the same page-sized bounding boxes?

amitdo commented 6 years ago

The fixes were in textord which handles the layout analysis.

@jbreiden, please search for fixes in https://github.com/tesseract-ocr/tesseract/blob/master/src/lstm/recodebeam.cpp https://github.com/tesseract-ocr/tesseract/blob/master/src/ccstruct/pageres.cpp

stweil commented 6 years ago

@zdenop, it looks like this issue needs to be reopened.

FrkBo commented 6 years ago

@Shreeshrii: I don't know whether the 'fix' only solves a symptom or also fixes the root cause.

The fixes provided by jbreiden didn't solve the issue of bounding boxes spanning the whole page in the example I provided earlier.

Update (Hopefully) got a little closer to a / the root cause of this issue. I noticed in pageres.cpp in the function ReplaceCurrentWord the line while (!src_b_it.empty() && src_b_it.data()->bounding_box().x_middle() < end_x) would not trigger for the lines where the bounding box spans the whole page. The attribute x_middle is calculated from the left and right value of the x-axis of the blob. Both values are equal to that of the line and therefore the if statement never triggers. Why does this happen?

It would seem that some lines are not being detected as being underlined. So the blob is not being detected as underlined and consequently the blob is the entire sentence. However the given words to ReplaceCurrentWord are split correctly. This would explain why the while statement is not triggered. All examples provided in this thread would seem to corroborate that the issue is related to underlined sentences.

If I force the underline test in makerow.cpp (.. if (test_underline() ...) to be always true the bounding boxes for the words in my example would seem to be calculated correctly. is this a step in the right direction?

jbreiden commented 6 years ago

@amitdo I'm doing some code comparisons. Looks like pageres.cpp is not the culprit. Still looking at recodebeam.cpp.

amitdo commented 6 years ago

OK, thanks.

jbreiden commented 6 years ago

I'm not seeing the problem in recodebeam.cpp either. I did find a tiny section of code that looks a little suspicious and definitely would benefit from another set of parenthesis. But that's not causing the problem. Did anyone get a chance to bisect the github code to see if the problem is present in older versions?

https://github.com/tesseract-ocr/tesseract/blob/master/src/lstm/recodebeam.cpp#L261

        if (best_glyphs.size() > 0 &&  i == best_glyphs.front().second-1
            || i == xcoords[word_end]-1)

Shreeshrii commented 6 years ago

If I force the underline test in makerow.cpp (.. if (test_underline() ...) to be always true the bounding boxes for the words in my example would seem to be calculated correctly. is this a step in the right direction?

@jbreiden Did you also review makerow.cpp?

amitdo commented 6 years ago

[also see @Shreeshrii's question above]

Did anyone get a chance to bisect the github code to see if the problem is present in older versions?

I did not.

1015 is a similar issue, reported in Jun 30, 2017.

amitdo commented 6 years ago

I confirmed the problem with current GitHub code on the "thousand Billion" example. However, the copy of Tesseract inside Google does not have the problem. I can think of two possibilities. One is the GitHub copy

@jbreiden, please share Google's Tesseract version output.

willaaam commented 6 years ago

I tried out the approach suggested by @FrkBo and while I think it doesnt fix the root cause, it definately prevents the full page bbox issue from happening in the examples I've tried.

I've forked the code to my own repo so you guys can easily verify the impact the suggested change makes.

Maybe it also helps someone who needs a workaround/patch yesterday

https://github.com/willaaam/tesseract?files=1

--edit -- Unfortunately while FrkBo's patch reduces the symptom, it's not gone... so this issue is still at large and really blocking. I'm still trying to find a public example that I can share.

Is there any way I can speed this up, because the root cause is deeper than realistic for myself to solve -- would this be able to be fixed faster is there is a bugfix bounty for example?

jbreiden commented 6 years ago

I'm doing a little bit of bisecting. GitHub Tesseract from 2017-08-04 has the same defect. GitHub Tesseract 2017-06-01 and earlier gives me an assertion. Current Google Tesseract does not have the problem, even if I splice in makerow.cpp recodebeam.cpp and pageres.cpp from GitHub. Last synchronization from GitHub Tesseract to Google Tesseract was on 2017-08-03. Google Tesseract reports 4.00.00alpha as version. So no answer yet, but at least we have more data.

$ export TESSDATA_PREFIX=$HOME/chroot2/usr/share/tesseract-ocr/4.00/tessdata
$ git checkout `git rev-list -n 1 --before="2017-08-04 13:37" master`
$ make clean; autogen.sh && ./configure && make -j 12
$ api/tesseract /tmp/numbers.png - pdf > /tmp/foo.pdf

$ git checkout `git rev-list -n 1 --before="2017-06-01 13:37" master`
$ make clean; autogen.sh && ./configure && make -j 12
$ api/tesseract /tmp/numbers.png - -
lstm_recognizer_->DeSerialize(&fp):Error:Assert failed:in file tessedit.cpp, line 193

willaaam commented 6 years ago

I'm doing a little bit of bisecting. GitHub Tesseract from 2017-08-04 has the same defect. GitHub Tesseract 2017-06-01 and earlier gives me an assertion. Current Google Tesseract does not have the problem, even if I splice in makerow.cpp recodebeam.cpp and pageres.cpp from GitHub. Last synchronization from GitHub Tesseract to Google Tesseract was on 2017-08-03. Google Tesseract reports 4.00.00alpha as version. So no answer yet, but at least we have more data.
$ export TESSDATA_PREFIX=$HOME/chroot2/usr/share/tesseract-ocr/4.00/tessdata
$ git checkout `git rev-list -n 1 --before="2017-08-04 13:37" master`
$ make clean; autogen.sh && ./configure && make -j 12
$ api/tesseract /tmp/numbers.png - pdf > /tmp/foo.pdf
$ git checkout `git rev-list -n 1 --before="2017-06-01 13:37" master`
$ make clean; autogen.sh && ./configure && make -j 12
$ api/tesseract /tmp/numbers.png - -
lstm_recognizer_->DeSerialize(&fp):Error:Assert failed:in file tessedit.cpp, line 193

Would it be possible to share "google tesseract" in order to track down the root cause at the date of the last merge (2017-08-03) since there arguably the relevant diffs are most likely the smallest. I'm willing to sign an NDA if that's what it takes.

jbreiden commented 6 years ago

So far, I have no evidence to suggest GitHub Tesseract has ever worked correctly on this data. An NDA is too much paperwork. Let me instead suggest a "pair debugging" session where we can look together at intermediate data structures, and try to track down where things diverge.

willaaam commented 6 years ago

So far, I have no evidence to suggest GitHub Tesseract has ever worked correctly on this data. An NDA is too much paperwork. Let me instead suggest a "pair debugging" session where we can look together at intermediate data structures, and try to track down where things diverge.

Good stuff @jbreiden - let's set something up. I'm not too knowledgeable on C++ (unless you count borland stuff ages ago) but it might make sense to invite @Sintun as well, he's been working on this for quite a while. My buddy who helped me trace this bug in the first place would be happy to join as well, he's quite well versed in C++.

Sintun commented 6 years ago

Hi @jbreiden @willaaam , I traced the problem back into RecodeBeamSearch::Decode as can be seen in #1712 . I invested another day, but was not able to find faulty code. Doing this i got the insight, that this behavior is model dependent (last comment in #1712), so for now i think that the coordinates originating from the LSTM are not precise enough (maybe it's precise enough for the english model) to serve as the basis for character and word position determination. From my position this "bug" is more like a feature of the LSTM approach and my time would be better spent by creating a workaround that uses the layout analysis / segmenter information for position readjustments or by adjusting the model training to generate better positional information. Nonetheless i hope, that my conclusion is wrong and that you find the source of this bug. @jbreiden GitHub Tesseract on my "thousand Billion" image gives correct positional information while using the normal or best english or the fast german model and it fails on the normal, best german and the fast english model. Gives the Google version correct positional information on all models ?

PS: I'm only speaking about word positions, character positions are off for all models.

amitdo commented 6 years ago

It certainly a bug in the cases where you get complete garbage bboxes.

FrkBo commented 6 years ago

Or when the TSV output is really a mess... While investigating this issue I looked into an example with Japanese characters (from another issue on this forum) and adapted, sliced & diced the image to determine whether there was one character that might mess up the output (to a certain extent to test the same hypothesis as Sintun whether the model might be the issue). The results were surprising!

27729378-9d2f8196-5dc0-11e7-9c99-5090652734bb_7

TSV output 5 1 1 1 1 1 14 16 316 49 80 」んはんは 5 1 1 1 1 2 343 20 48 45 92 ん 5 1 1 1 1 3 401 20 47 45 92 ば 5 1 1 1 1 4 440 16 63 49 91 ん 5 1 1 1 1 5 513 20 47 45 91 は 2 1 2 0 0 0 5 91 553 194 -1
3 1 2 1 0 0 5 91 553 194 -1
4 1 2 1 1 0 241 91 317 51 -1
5 1 2 1 1 1 241 94 29 47 87 り 5 1 2 1 1 2 399 92 66 50 90 年 5 1 2 1 1 3 470 94 34 45 96 は 5 1 2 1 1 4 510 91 48 51 96 初 4 1 2 1 2 0 5 161 216 50 -1
5 1 2 1 2 1 5 161 216 50 83 年て人出

A bit odd to see how multiple characters can be recognized as one character. I compiled a fresh version of Github Tesseract to make sure I wasn't the culprit.

On another note, while looking through recodebeam.cpp the following caught my attention: for (int i = word_start; i < word_end; ++i) { int min_half_width = xcoords[i + 1] - xcoords[i]; if (i > 0 && xcoords[i] - xcoords[i - 1] < min_half_width) min_half_width = xcoords[i] - xcoords[i - 1]; if (min_half_width < 1) min_half_width = 1; // Make a fake blob. TBOX box(xcoords[i] - min_half_width, 0, xcoords[i] + min_half_width, line_box.height());

The variable min_half_width implies it should be a half width of the character but is never halved. And when the bounding box is determined the width ends up as 2 widths of the character. Might be intentional, but seems weird from a distance. Maybe the variable needs to be renamed. If you would divide min_half_width by 2 it would prevent some bounding boxes spanning the whole page but the coordinates are still off.

Sintun commented 6 years ago

Hello again :) since i have a workaround for #1712 , i started looking more closely (and independent of #1712) on this problem. And i found a few hints I want to share with you.

The bounding boxes spanning the whole page originate from

// Returns the bounding box of only the good blobs.
TBOX WERD::true_bounding_box() const {
  TBOX box;  // box being built
  // This is a read-only iteration of the good blobs.
  if( cblobs.length() == 0 )
    printf("(E) cblobs_size = %d, ", cblobs.length() );
  C_BLOB_IT it(const_cast<C_BLOB_LIST*>(&cblobs));
  for (it.mark_cycle_pt(); !it.cycled_list(); it.forward()) {
    box += it.data()->bounding_box();
  }
  return box;
}

The reason is, that cblobs.length() is zero for word boxes that span the entire page. My example output on the image provided by @willaaam

~ cblobs_size = 0, rej_cblobs_size = 0, BoundingBoxInternal [32767,-32767; -32767,32767] word: 'risolutiva'; conf: 96.56; BoundingBox: 0,0,2485,286; 1.000000

Now I looked at the cblobs generated in lstmrecognizer (they are not empty at that point) and followed them through the code to the point where we loose the cblobs. If i understand the data structures correctly this happens in ccstruct/pageres.cpp in the function ReplaceCurrentWord that is used in ccmain/control.cpp in function classify_word_and_language at the line pr_it->ReplaceCurrentWord(&best_words); (around line 1400)

at least the code (in ccmain/control.cpp line ~1400)

// Words came from LSTM, and must be moved to the PAGE_RES properly.
if( best_words.back()->word->cblob_list()->empty() )
  printf("(E) classify_word_and_language best_words broken\n" );
word_data->word = best_words.back();
if( word_data->word->word->cblob_list()->empty() )
  printf("(E) classify_word_and_language pointer copy broken\n" );
pr_it->ReplaceCurrentWord(&best_words);
if( word_data->word->word->cblob_list()->empty() )
  printf("(E) classify_word_and_language lost cblobs after "
    "pr_it->ReplaceCurrentWord\n" );
if( pr_it->word()->word->cblob_list()->empty() )
  printf("(E) classify_word_and_language blob list not transferred to"
  " pr_it !\n" );

gives the output

(E) classify_word_and_language lost cblobs after pr_it->ReplaceCurrentWord
(E) classify_word_and_language blob list not transferred to pr_it !

amitdo commented 6 years ago

Bounding boxes for words in ocropy: https://github.com/tmbdev/ocropy/pull/314

theraysmith commented 6 years ago

Jeff and I spent some time last week looking at this problem. Unfortunately, it is due to the network architecture. Very briefly, the forward-backward-forward network architecture (See tutorial slides https://github.com/tesseract-ocr/docs/blob/master/das_tutorial2016/6ModernizationEfforts.pdf for a picture) allows the network to output characters at a position unrelated to the image. See the attached image for an example that shows why the word boxes are wrong. The green lines show where the network thinks the characters are. The character bounding boxes therefore will always be inaccurate, unless we can retrain with a standard bidi LSTM on the top layer. (That should fix it for almost all cases. Whether the same character accuracy can be achieved remains to be seen.) As a work-around I have a code change that will correct the word bounding boxes most of the time. That will be uploaded to github soon...

amitdo commented 6 years ago

Ray and Jeff, Thanks for the investigation :-)

willaaam commented 6 years ago

I think we're still describing two cases here:

Full page bounding boxes
Incorrect word level bounding boxes

On the second: I get that the inaccurate character level bounding boxes are a consequence of the beam search implementation - that makes sense.

What I don't get is in FrkBo's example the "thousand" bbox goes well into "Billion". While the beam search is by definition not fully accurate, we know the position of the search and it did find Billion, so assuming that B got recognized properly, the bounding box for thousand should never be able to move in to the space of Billion.

On the first issue: What doesn't make sense to me intuitively is why there are full page bounding boxes. The page layout engine determines the lines and while the vertical separator might be a few pixels off due to beam searching, I dont see the impact on the horizontal separation here, as the entire line enters the beam search and should be able to get mapped back.

Regardless - I'm guessing the fix you're working on ray will help with word level accuracy but doesnt fix the whole page bounding boxes.

jbreiden commented 6 years ago

Ray and I have been focusing on the "thousand Billion" image because it is so small therefore easier to debug. Not much thought spent on the full page bounding box report. Please note that I managed to fool myself during some earlier investigation. Contrary to what I said earlier, the problem was present in Google's copy of Tesseract.

FrkBo commented 6 years ago

@jbreiden Did you have the opportunity yet to look at makerow.cpp? The full page bounding bounding boxes seem to be (partly) related to lines not being detected as underlined and the sentence is just one blob. Also Vidiecan really made a good effort in describing the resulting issue in #2024.

vidiecan commented 6 years ago

@frkbo This issue represents (at least) two unrelated one with the same result - wrong coordinates.

The first one described by @theraysmith / @jbreiden and in #2024 above is because of the LSTM models returning x-coordinates that are more or less offset.

The second one is simply because the cblobs representing individual letters (which are used to get the final bbox) are missing because of noise below like an underline.

jbreiden commented 6 years ago

From Ray. Feedback welcome.

Fix github issue 1192 (as best as it can be without retraining). Blob bounding boxes are clipped to the corresponding word in the page layout analysis, provided there is a clear set of 1 or more significantly overlapping source words. Where the LSTM has resegmented the words, there is no choice, but to use the bounding boxes provided by the LSTM model, which will be inaccurate.

--- tesseract/ccstruct/pageres.cpp  2016-12-13 16:56:14.000000000 -0800
+++ tesseract/ccstruct/pageres.cpp  2018-11-09 10:28:33.000000000 -0800
@@ -21,13 +21,14 @@
  ** limitations under the License.
  *
  **********************************************************************/
-#include          <stdlib.h>
+#include "pageres.h"
+#include <stdlib.h>
 #ifdef __UNIX__
-#include          <assert.h>
+#include <assert.h>
 #endif
-#include          "blamer.h"
-#include          "pageres.h"
-#include          "blobs.h"
+#include "blamer.h"
+#include "blobs.h"
+#include "helpers.h"

 ELISTIZE (BLOCK_RES)
 CLISTIZE (BLOCK_RES) ELISTIZE (ROW_RES) ELISTIZE (WERD_RES)
@@ -1293,7 +1294,8 @@
 // Helper computes the boundaries between blobs in the word. The blob bounds
 // are likely very poor, if they come from LSTM, where it only outputs the
 // character at one pixel within it, so we find the midpoints between them.
-static void ComputeBlobEnds(const WERD_RES& word, C_BLOB_LIST* next_word_blobs,
+static void ComputeBlobEnds(const WERD_RES& word, const TBOX& clip_box,
+                            C_BLOB_LIST* next_word_blobs,
                             GenericVector<int>* blob_ends) {
   C_BLOB_IT blob_it(word.word->cblob_list());
   for (int i = 0; i < word.best_state.size(); ++i) {
@@ -1313,8 +1315,74 @@
         blob_it.set_to_list(next_word_blobs);
       blob_end = (blob_box.right() + blob_it.data()->bounding_box().left()) / 2;
     }
+    blob_end = ClipToRange<int>(blob_end, clip_box.left(), clip_box.right());
     blob_ends->push_back(blob_end);
   }
+  blob_ends->back() = clip_box.right();
+}
+
+// Helper computes the bounds of a word by restricting it to existing words
+// that significantly overlap.
+static TBOX ComputeWordBounds(const tesseract::PointerVector<WERD_RES>& words,
+                              int w_index, TBOX prev_box, WERD_RES_IT w_it) {
+  constexpr int kSignificantOverlapFraction = 4;
+  TBOX clipped_box;
+  TBOX current_box = words[w_index]->word->bounding_box();
+  TBOX next_box;
+  if (w_index + 1 < words.size() && words[w_index + 1] != nullptr &&
+      words[w_index + 1]->word != nullptr)
+    next_box = words[w_index + 1]->word->bounding_box();
+  for (w_it.forward(); !w_it.at_first() && w_it.data()->part_of_combo;
+       w_it.forward()) {
+    if (w_it.data() == nullptr || w_it.data()->word == nullptr) continue;
+    TBOX w_box = w_it.data()->word->bounding_box();
+    int height_limit = std::min<int>(w_box.height(), w_box.width() / 2);
+    int width_limit = w_box.width() / kSignificantOverlapFraction;
+    int min_significant_overlap = std::max(height_limit, width_limit);
+    int overlap = w_box.intersection(current_box).width();
+    int prev_overlap = w_box.intersection(prev_box).width();
+    int next_overlap = w_box.intersection(next_box).width();
+    if (overlap > min_significant_overlap) {
+      if (prev_overlap > min_significant_overlap) {
+        // We have no choice but to use the LSTM word edge.
+        clipped_box.set_left(current_box.left());
+      } else if (next_overlap > min_significant_overlap) {
+        // We have no choice but to use the LSTM word edge.
+        clipped_box.set_right(current_box.right());
+      } else {
+        clipped_box += w_box;
+      }
+    }
+  }
+  if (clipped_box.height() <= 0) {
+    clipped_box.set_top(current_box.top());
+    clipped_box.set_bottom(current_box.bottom());
+  }
+  if (clipped_box.width() <= 0) clipped_box = current_box;
+  return clipped_box;
+}
+
+// Helper moves the blob from src to dest. If it isn't contained by clip_box,
+// the blob is replaced by a fake that is contained.
+static TBOX MoveAndClipBlob(C_BLOB_IT* src_it, C_BLOB_IT* dest_it,
+                            const TBOX& clip_box) {
+  C_BLOB* src_blob = src_it->extract();
+  TBOX box = src_blob->bounding_box();
+  if (!clip_box.contains(box)) {
+    int left =
+        ClipToRange<int>(box.left(), clip_box.left(), clip_box.right() - 1);
+    int right =
+        ClipToRange<int>(box.right(), clip_box.left() + 1, clip_box.right());
+    int top =
+        ClipToRange<int>(box.top(), clip_box.bottom() + 1, clip_box.top());
+    int bottom =
+        ClipToRange<int>(box.bottom(), clip_box.bottom(), clip_box.top() - 1);
+    box = TBOX(left, bottom, right, top);
+    delete src_blob;
+    src_blob = C_BLOB::FakeBlob(box);
+  }
+  dest_it->add_after_then_move(src_blob);
+  return box;
 }

 // Replaces the current WERD/WERD_RES with the given words. The given words
@@ -1365,66 +1433,45 @@
   src_b_it.sort(&C_BLOB::SortByXMiddle);
   C_BLOB_IT rej_b_it(input_word->word->rej_cblob_list());
   rej_b_it.sort(&C_BLOB::SortByXMiddle);
+  TBOX clip_box;
   for (int w = 0; w < words->size(); ++w) {
     WERD_RES* word_w = (*words)[w];
+    clip_box = ComputeWordBounds(*words, w, clip_box, wr_it_of_current_word);
     // Compute blob boundaries.
     GenericVector<int> blob_ends;
     C_BLOB_LIST* next_word_blobs =
         w + 1 < words->size() ? (*words)[w + 1]->word->cblob_list() : NULL;
-    ComputeBlobEnds(*word_w, next_word_blobs, &blob_ends);
-    // Delete the fake blobs on the current word.
+    ComputeBlobEnds(*word_w, clip_box, next_word_blobs, &blob_ends);
+    // Remove the fake blobs on the current word, but keep safe for back-up if
+    // no blob can be found.
+    C_BLOB_LIST fake_blobs;
+    C_BLOB_IT fake_b_it(&fake_blobs);
+    fake_b_it.add_list_after(word_w->word->cblob_list());
+    fake_b_it.move_to_first();
     word_w->word->cblob_list()->clear();
     C_BLOB_IT dest_it(word_w->word->cblob_list());
     // Build the box word as we move the blobs.
     tesseract::BoxWord* box_word = new tesseract::BoxWord;
-    for (int i = 0; i < blob_ends.size(); ++i) {
+    for (int i = 0; i < blob_ends.size(); ++i, fake_b_it.forward()) {
       int end_x = blob_ends[i];
       TBOX blob_box;
       // Add the blobs up to end_x.
       while (!src_b_it.empty() &&
              src_b_it.data()->bounding_box().x_middle() < end_x) {
-        blob_box += src_b_it.data()->bounding_box();
-        dest_it.add_after_then_move(src_b_it.extract());
+        blob_box += MoveAndClipBlob(&src_b_it, &dest_it, clip_box);
         src_b_it.forward();
       }
       while (!rej_b_it.empty() &&
              rej_b_it.data()->bounding_box().x_middle() < end_x) {
-        blob_box += rej_b_it.data()->bounding_box();
-        dest_it.add_after_then_move(rej_b_it.extract());
+        blob_box += MoveAndClipBlob(&rej_b_it, &dest_it, clip_box);
         rej_b_it.forward();
       }
-      // Clip to the previously computed bounds. Although imperfectly accurate,
-      // it is good enough, and much more complicated to determine where else
-      // to clip.
-      if (i > 0 && blob_box.left() < blob_ends[i - 1])
-        blob_box.set_left(blob_ends[i - 1]);
-      if (blob_box.right() > end_x)
-        blob_box.set_right(end_x);
+      if (blob_box.null_box()) {
+        // Use the original box as a back-up.
+        blob_box = MoveAndClipBlob(&fake_b_it, &dest_it, clip_box);
+      }
       box_word->InsertBox(i, blob_box);
     }
-    // Fix empty boxes. If a very joined blob sits over multiple characters,
-    // then we will have some empty boxes from using the middle, so look for
-    // overlaps.
-    for (int i = 0; i < box_word->length(); ++i) {
-      TBOX box = box_word->BlobBox(i);
-      if (box.null_box()) {
-        // Nothing has its middle in the bounds of this blob, so use anything
-        // that overlaps.
-        for (dest_it.mark_cycle_pt(); !dest_it.cycled_list();
-             dest_it.forward()) {
-          TBOX blob_box = dest_it.data()->bounding_box();
-          if (blob_box.left() < blob_ends[i] &&
-              (i == 0 || blob_box.right() >= blob_ends[i - 1])) {
-            if (i > 0 && blob_box.left() < blob_ends[i - 1])
-              blob_box.set_left(blob_ends[i - 1]);
-            if (blob_box.right() > blob_ends[i])
-              blob_box.set_right(blob_ends[i]);
-            box_word->ChangeBox(i, blob_box);
-            break;
-          }
-        }
-      }
-    }
     delete word_w->box_word;
     word_w->box_word = box_word;
     if (!input_word->combination) {
@@ -1545,6 +1592,7 @@
       }
     }
     ASSERT_HOST(!word_res_it.cycled_list());
+    wr_it_of_next_word = word_res_it;
     word_res_it.forward();
   } else {
     // word_res_it is OK, but reset word_res and prev_word_res if needed.
@@ -1582,6 +1630,7 @@
   block_res = next_block_res;
   row_res = next_row_res;
   word_res = next_word_res;
+  wr_it_of_current_word = wr_it_of_next_word;
   next_block_res = NULL;
   next_row_res = NULL;
   next_word_res = NULL;
@@ -1610,6 +1659,7 @@
         next_block_res = block_res_it.data();
         next_row_res = row_res_it.data();
         next_word_res = word_res_it.data();
+        wr_it_of_next_word = word_res_it;
         word_res_it.forward();
         goto foundword;
       }
--- tesseract/ccstruct/pageres.h    2016-11-07 07:44:03.000000000 -0800
+++ tesseract/ccstruct/pageres.h    2018-11-09 10:28:33.000000000 -0800
@@ -772,5 +772,9 @@
   BLOCK_RES_IT block_res_it;   // iterators
   ROW_RES_IT row_res_it;
   WERD_RES_IT word_res_it;
+  // Iterators used to get the state of word_res_it for the current word.
+  // Since word_res_it is 2 words further on, this is otherwise hard to do.
+  WERD_RES_IT wr_it_of_current_word;
+  WERD_RES_IT wr_it_of_next_word;
 };
 #endif

Shreeshrii commented 6 years ago

@zdenop please create a PR with the changes suggested by Ray.

stweil commented 6 years ago

@jbreiden, can you create a Git patch using git format-patch for Ray's commit? I could merge it then, so commit author and date would be preserved.

zdenop commented 5 years ago

commited. @stweil: you can set date and author directly in git... I did it with: git commit --amend -m "fix issue #1192" --no-edit --date "2018-11-09 10:28:33.000000000 -0800" --author="Ray Smith <theraysmith@gmail.com>" So it still transparent who made patch and who push it to github.

tesseract-ocr / tesseract

Noise characters recognized with bbox as the entire page #1192

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

1015 is a similar issue, reported in Jun 30, 2017.