Open troplin opened 6 years ago
Seems like this could be related to the issue I'm facing.
https://github.com/tesseract-ocr/tesseract/issues/1192#issuecomment-418182462
I don't think there is a bug in this case.
The lstm network only returns xcoords. AFAIK, each xcoord is just one spot (pixel) in each recognized glyph. The spot can be in the beginning, middle or end of each glyph. Because the lstm net does not produce bboxes, tesseract tries to estimate the bboxes from these xcoords. This process can't be done accurately.
In cases where some bboxes are just complete garbage, there is certainly a bug somewhere, probably in the conversion from xcoords to bboxes.
Don't mix between the 'no bug' cases and the buggy cases.
Thinking about this again. It's possible that there is a small bug here.
As said, the cases with complete garbage boxes are certainly bugs.
Hey there, I'm still working on this bug.
By looking at the internal values of tesseract applied to
I tracked the issue down to RecodeBeamSearch::Decode
.
After the application of this function
beam_[beam_size_ - 1]->beams_[beam_index].get(h).data
contains slightly off positions, duplicates, unichar_id s for all valid values of beam_index, h
.
For example, the positions of u and s are attributed to the left and right half of the letter u.
It is possible, that this function works correctly but gets the off-values from the lstmrecognizer. I will investigate this the next time i can take a look at this.
Reproducible by dropping the following code at the end of RecodeBeamSearch::Decode
{
GenericVector<const RecodeNode*> _best_nodes;
GenericVector<const RecodeNode*>* best_nodes = &_best_nodes;
const RecodeBeam* last_beam = beam_[beam_size_ - 1];
unsigned node_ct = 0;
for (int c = 0; c < NC_COUNT; ++c)
{
NodeContinuation cont = static_cast<NodeContinuation>(c);
for (int is_dawg = 0; is_dawg < 2; ++is_dawg)
{
int beam_index = BeamIndex(is_dawg, cont, 0);
int heap_size = last_beam->beams_[beam_index].size();
for (int h = 0; h < heap_size; ++h)
{
const RecodeNode* node = &last_beam->beams_[beam_index].get(h).data;
ExtractPath(node, best_nodes);
++node_ct;
int dbg_ct = 0;
for( unsigned i = 0; i < best_nodes->size(); ++i )
{
if( (*best_nodes)[i]->unichar_id != INVALID_UNICHAR_ID )
printf("~ %d %d x-break pos %d, "
"conf:%f, duplicate:%d, u-id:%d\n", node_ct, dbg_ct++, int(std::round(i*2.75f)),
(*best_nodes)[i]->certainty,int((*best_nodes)[i]->duplicate),
(*best_nodes)[i]->unichar_id);
}
}
}
}
}
Multiplication by 2.75 is done in order to obtain the original image coordinates. This factor probably only applies to exactly this image.
Example output (for one of the beam_index, h combinations) :
~ 2 13 x-break pos 61, conf:-0.085068, duplicate:1, u-id:101
~ 2 14 x-break pos 72, conf:-0.085045, duplicate:0, u-id:91
~ 2 15 x-break pos 74, conf:-0.085093, duplicate:1, u-id:91
~ 2 16 x-break pos 77, conf:-0.085023, duplicate:1, u-id:91
~ 2 17 x-break pos 80, conf:-0.089978, duplicate:1, u-id:91
~ 2 18 x-break pos 83, conf:-0.085023, duplicate:1, u-id:91
~ 2 19 x-break pos 91, conf:-0.085059, duplicate:0, u-id:94
~ 2 20 x-break pos 94, conf:-0.086384, duplicate:1, u-id:94
~ 2 21 x-break pos 96, conf:-0.085012, duplicate:1, u-id:94
~ 2 22 x-break pos 99, conf:-0.085054, duplicate:1, u-id:94
~ 2 23 x-break pos 102, conf:-0.085051, duplicate:1, u-id:94
~ 2 24 x-break pos 113, conf:-0.085026, duplicate:0, u-id:89
~ 2 25 x-break pos 116, conf:-0.085202, duplicate:1, u-id:89
u-id 91 and 94 are the letters u and s. the x-break positions show that they are attributed to the position of u in the original image.
Hey there,
I invested another day, and got into the LSTM decoder. I found nothing wrong with it and so i tested the outcome of different lstm models for the same and different languages. It turns out that this behavior is strongly model dependent, for example the standard and best english model as well as the fast german give the correct bounding boxes for the "thousand Billion" example, while other models of these languages fail to do so. So i came to the conclusion, that the LSTM output is unreliable for exact character & word positions. Nonetheless the output of the segmenter ist still very good, it just gets discarded before the lines go into the LSTM.
I will write a little function, that first uses tesseracts layout analysis to obtain word (and probably character) positions and then overwrite the LSTM results with the results of the layout analysis when possible.
So the bug source is unfixable for me and a possible workaround can be done. I don't think workarounds should be part of tesseract, so i will do this outside of Tesseract after getting the necessary information via the api.
At the moment I'm planning to only create this workaround function for my own project. But I'm willing to integrate this workaround in the tesseract command line tool if someone is interested in using it.
I will write a little function, that first uses tesseracts layout analysis to obtain word (and probably character) positions and then overwrite the LSTM results with the results of the layout analysis when possible.
@Sintun I think this function will be useful for many Tesseract users, specially those who have been relying on the character coordinates in older versions. It would be great if you will integrate it in tesseract.
I will write a little function, that first uses tesseracts layout analysis to obtain word (and probably character) positions and then overwrite the LSTM results with the results of the layout analysis when possible.
@Sintun I think this function will be useful for many Tesseract users, specially those who have been relying on the character coordinates in older versions. It would be great if you will integrate it in tesseract.
I am wondering if what I see in this screenshot is related to the same problem. The text is correctly extracted but the bounding rectangles are wrong for a couple of words.
Hi @FarhadKhalafi , yes that is this bug. I have a working workaround (for my own project) that reliably restores the word boxes and gives some reasonable results for character positions. It should also work for #1192. I'm planning on integrating it into tesseract within the next 2-3 weeks.
Hi @Sintun , Thanks for the update. Accurate location of, at least, words is often critical. I use the OCR text for table extraction and this kind of shift will break my algorithms. If you are willing to share your fix, I might be able to assist or at least test.
Specifically testing would be greatly appreciated :) Sharing the workaround code would be of small value, because it solves the problem after the tesseract api calls (even after the tess object destruction) in my project. I will refactor this in order to do this within tesseract.
Even sharing that might help someone Sintun, a person in my team is writing that after API fix as well for python (it corrects the HOCR). I'll be sharing that as well when its done.
Hey there, i just started to write my workaround and looked at the state of affairs. I had to realize, that ray created a native workaround using the historically grown tesseract objects. From my perspective, this is the best case, because i wouldn't had the time to do this in the tess framework. A closer look reveals, that this is similiar to my external approach, so i'm stopping my work on this. As can be seen in #1192 it should find it's way into the git repo within the next days.
Please test current code.
I tested the new code for accurate placement of bounding rectangles for recognized words. The code works much better in identifying the left and right edges of words. However, I did observe certain anomalies with the vertical placement of bounding rectangles. I have attached a ZIP of two TIFF files. The file named 85754_full.tif shows the vertical placement problem:
If I crop this image and zoom on the problem area, then the anomaly disappears (see 85754_crop.tif):
My guess is that the surrounding boxes influence the vertical placement when the text lines are detected.
Environment
Tesseract Version:
tesseract --version tesseract 4.0.0-beta.1 leptonica-1.76.0 (Jun 26 2018, 18:21:40) [MSC v.1900 LIB Release x86] libgif 5.1.4 : libjpeg 9b : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX Found SSE (says beta.1, but beta.3 seems to be correct)
We've integrated the engine in our (closed-source) application using the C API, so I cannot share the actual code.
What I do is basically just iterating over the result iterator using
RIL_WORD
, get the text, bounding box and baseline for each word and then creating a PDF out of it, with the recognized text as a red overlay and drawing the bounding boxes in green for better visibility.But the official PDF output config has the same flaw, it's just more difficult to spot:
produces the following PDF: andromeda.tess4cli.pdf
Files:
Behavior can be reproduced using the following PNG (which is a part of a bigger file):
Current Behavior:
The coordinates are correct for most words. But for some words, there seems to be an error in computing the boundaries between the words. Couriously, the wrong boundary is always before the last character of the previous word. I almost looks like some kind of off-by-one error.
Expected Behavior:
Result with tesseract 3: