tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.26k stars 9.51k forks source link

Incorrect word coordinates #1712

Open troplin opened 6 years ago

troplin commented 6 years ago

Environment

We've integrated the engine in our (closed-source) application using the C API, so I cannot share the actual code.

What I do is basically just iterating over the result iterator using RIL_WORD, get the text, bounding box and baseline for each word and then creating a PDF out of it, with the recognized text as a red overlay and drawing the bounding boxes in green for better visibility.

But the official PDF output config has the same flaw, it's just more difficult to spot:

tesseract andromeda.png andromeda.tess4cli pdf

produces the following PDF: andromeda.tess4cli.pdf

Files:

Behavior can be reproduced using the following PNG (which is a part of a bigger file):

andromeda

Current Behavior:

The coordinates are correct for most words. But for some words, there seems to be an error in computing the boundaries between the words. Couriously, the wrong boundary is always before the last character of the previous word. I almost looks like some kind of off-by-one error.

grafik

grafik

Expected Behavior:

Result with tesseract 3:

grafik

grafik

willaaam commented 6 years ago

Seems like this could be related to the issue I'm facing.

https://github.com/tesseract-ocr/tesseract/issues/1192#issuecomment-418182462

amitdo commented 6 years ago

I don't think there is a bug in this case.

The lstm network only returns xcoords. AFAIK, each xcoord is just one spot (pixel) in each recognized glyph. The spot can be in the beginning, middle or end of each glyph. Because the lstm net does not produce bboxes, tesseract tries to estimate the bboxes from these xcoords. This process can't be done accurately.

In cases where some bboxes are just complete garbage, there is certainly a bug somewhere, probably in the conversion from xcoords to bboxes.

Don't mix between the 'no bug' cases and the buggy cases.

amitdo commented 6 years ago

Thinking about this again. It's possible that there is a small bug here.

As said, the cases with complete garbage boxes are certainly bugs.

Sintun commented 6 years ago

Hey there, I'm still working on this bug. By looking at the internal values of tesseract applied to test_3 I tracked the issue down to RecodeBeamSearch::Decode. After the application of this function beam_[beam_size_ - 1]->beams_[beam_index].get(h).data contains slightly off positions, duplicates, unichar_id s for all valid values of beam_index, h. For example, the positions of u and s are attributed to the left and right half of the letter u.

It is possible, that this function works correctly but gets the off-values from the lstmrecognizer. I will investigate this the next time i can take a look at this.

Reproducible by dropping the following code at the end of RecodeBeamSearch::Decode

{
    GenericVector<const RecodeNode*> _best_nodes;
    GenericVector<const RecodeNode*>* best_nodes = &_best_nodes;
    const RecodeBeam* last_beam = beam_[beam_size_ - 1];
    unsigned node_ct = 0;
    for (int c = 0; c < NC_COUNT; ++c)
    {
      NodeContinuation cont = static_cast<NodeContinuation>(c);
      for (int is_dawg = 0; is_dawg < 2; ++is_dawg)
      {
        int beam_index = BeamIndex(is_dawg, cont, 0);
        int heap_size = last_beam->beams_[beam_index].size();
        for (int h = 0; h < heap_size; ++h)
        {
          const RecodeNode* node = &last_beam->beams_[beam_index].get(h).data;
          ExtractPath(node, best_nodes);
          ++node_ct;

          int dbg_ct = 0;
          for( unsigned i = 0; i < best_nodes->size(); ++i )
          {
            if( (*best_nodes)[i]->unichar_id != INVALID_UNICHAR_ID )
              printf("~ %d %d x-break pos %d, "
                "conf:%f, duplicate:%d, u-id:%d\n", node_ct, dbg_ct++, int(std::round(i*2.75f)),
                (*best_nodes)[i]->certainty,int((*best_nodes)[i]->duplicate),
                (*best_nodes)[i]->unichar_id);
          }
        }
      }
    }
  }

Multiplication by 2.75 is done in order to obtain the original image coordinates. This factor probably only applies to exactly this image.

Example output (for one of the beam_index, h combinations) :

~ 2 13 x-break pos 61, conf:-0.085068, duplicate:1, u-id:101
~ 2 14 x-break pos 72, conf:-0.085045, duplicate:0, u-id:91
~ 2 15 x-break pos 74, conf:-0.085093, duplicate:1, u-id:91
~ 2 16 x-break pos 77, conf:-0.085023, duplicate:1, u-id:91
~ 2 17 x-break pos 80, conf:-0.089978, duplicate:1, u-id:91
~ 2 18 x-break pos 83, conf:-0.085023, duplicate:1, u-id:91
~ 2 19 x-break pos 91, conf:-0.085059, duplicate:0, u-id:94
~ 2 20 x-break pos 94, conf:-0.086384, duplicate:1, u-id:94
~ 2 21 x-break pos 96, conf:-0.085012, duplicate:1, u-id:94
~ 2 22 x-break pos 99, conf:-0.085054, duplicate:1, u-id:94
~ 2 23 x-break pos 102, conf:-0.085051, duplicate:1, u-id:94
~ 2 24 x-break pos 113, conf:-0.085026, duplicate:0, u-id:89
~ 2 25 x-break pos 116, conf:-0.085202, duplicate:1, u-id:89

u-id 91 and 94 are the letters u and s. the x-break positions show that they are attributed to the position of u in the original image.

Sintun commented 6 years ago

Hey there,

I invested another day, and got into the LSTM decoder. I found nothing wrong with it and so i tested the outcome of different lstm models for the same and different languages. It turns out that this behavior is strongly model dependent, for example the standard and best english model as well as the fast german give the correct bounding boxes for the "thousand Billion" example, while other models of these languages fail to do so. So i came to the conclusion, that the LSTM output is unreliable for exact character & word positions. Nonetheless the output of the segmenter ist still very good, it just gets discarded before the lines go into the LSTM.

I will write a little function, that first uses tesseracts layout analysis to obtain word (and probably character) positions and then overwrite the LSTM results with the results of the layout analysis when possible.

So the bug source is unfixable for me and a possible workaround can be done. I don't think workarounds should be part of tesseract, so i will do this outside of Tesseract after getting the necessary information via the api.

At the moment I'm planning to only create this workaround function for my own project. But I'm willing to integrate this workaround in the tesseract command line tool if someone is interested in using it.

Shreeshrii commented 6 years ago

I will write a little function, that first uses tesseracts layout analysis to obtain word (and probably character) positions and then overwrite the LSTM results with the results of the layout analysis when possible.

@Sintun I think this function will be useful for many Tesseract users, specially those who have been relying on the character coordinates in older versions. It would be great if you will integrate it in tesseract.

Shreeshrii commented 6 years ago

I will write a little function, that first uses tesseracts layout analysis to obtain word (and probably character) positions and then overwrite the LSTM results with the results of the layout analysis when possible.

@Sintun I think this function will be useful for many Tesseract users, specially those who have been relying on the character coordinates in older versions. It would be great if you will integrate it in tesseract.

FarhadKhalafi commented 6 years ago

I am wondering if what I see in this screenshot is related to the same problem. The text is correctly extracted but the bounding rectangles are wrong for a couple of words.

image

Sintun commented 6 years ago

Hi @FarhadKhalafi , yes that is this bug. I have a working workaround (for my own project) that reliably restores the word boxes and gives some reasonable results for character positions. It should also work for #1192. I'm planning on integrating it into tesseract within the next 2-3 weeks.

FarhadKhalafi commented 6 years ago

Hi @Sintun , Thanks for the update. Accurate location of, at least, words is often critical. I use the OCR text for table extraction and this kind of shift will break my algorithms. If you are willing to share your fix, I might be able to assist or at least test.

Sintun commented 6 years ago

Specifically testing would be greatly appreciated :) Sharing the workaround code would be of small value, because it solves the problem after the tesseract api calls (even after the tess object destruction) in my project. I will refactor this in order to do this within tesseract.

willaaam commented 6 years ago

Even sharing that might help someone Sintun, a person in my team is writing that after API fix as well for python (it corrects the HOCR). I'll be sharing that as well when its done.

Sintun commented 6 years ago

Hey there, i just started to write my workaround and looked at the state of affairs. I had to realize, that ray created a native workaround using the historically grown tesseract objects. From my perspective, this is the best case, because i wouldn't had the time to do this in the tess framework. A closer look reveals, that this is similiar to my external approach, so i'm stopping my work on this. As can be seen in #1192 it should find it's way into the git repo within the next days.

zdenop commented 6 years ago

Please test current code.

FarhadKhalafi commented 5 years ago

I tested the new code for accurate placement of bounding rectangles for recognized words. The code works much better in identifying the left and right edges of words. However, I did observe certain anomalies with the vertical placement of bounding rectangles. I have attached a ZIP of two TIFF files. The file named 85754_full.tif shows the vertical placement problem:

image

If I crop this image and zoom on the problem area, then the anomaly disappears (see 85754_crop.tif): image

My guess is that the surrounding boxes influence the vertical placement when the text lines are detected.

85754.zip