Closed vidiecan closed 5 years ago
Do you have a fix that will work well, even in the presence of ligatures and noise? Also, it should work for all supported langs.
@amitdo I do not, I would have sent a PR if I had :)
Is this issue fixed in 4.1/current code?
@noahmetzger, do you get better character boxes for this example with your latest code?
I think so. Will test this later.
I would also be curious if this is fixed. Working with v4.0 there is often trouble with wider characters, which then causes an offset of all following character bounding boxes. Any news above?
I tried 4.1.0-rc1 at the time, with slightly different but not fixed results. Ok so 4.1.0 is a definite release now, great thanks! I will give the Windows installer a try in combination with the tesserocr Python package if possible.
This should be fixed by pull request #2576.
tesseract wrong.png - -l eng --tessdata-dir ~/tessdata_fast --oem 1 --psm 6 makebox
1 32 15 40 50 0
2 113 16 136 51 0
9 139 15 162 51 0
3 167 15 186 51 0
. 193 16 200 21 0
0 206 15 227 51 0
0 232 15 254 51 0
tesseract -v
tesseract 5.0.0-alpha-322-g74ac
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.4.4 : libopenjp2 2.3.0
Issue is fixed in master branch.
@zdenop Please close.
I am not sure if the following observation is covered by this ticket. But maybe it needs to be reopened. We had several issues with bounding box coordinates and tried makebox, hocr with the executable and even on API level. Now we tested the often used "thousand Billion" image on the latest 5.0.0-alpha version on windows to check if it was fixed but again got the same starting coordinates for some letters:
<span class='ocrx_cinfo' title='x_bboxes 210 22 218 52; x_conf 99.543304'>B</span>
<span class='ocrx_cinfo' title='x_bboxes 210 23 234 52; x_conf 99.536743'>i</span>
We have this issue on all of our example files and on all 4.* versions of tesseract, too.
Thanks for all you work and have a great christmas time!
I am posting it here because it might be related to #1712 and #1192. I think it is questionable what should be the solution because some might prefer the current behaviour. Can be closed any time.
Behaviour: doing OCR using LSTM with specific model returns invalid character bounding box when calling
PageIterator::BoundingBox
.Reproducibility: very likely not because of the missing model.
Input image (without the coloured lines):
Before LSTM, fake words are created that contain correct blobs based on outlines. After LSTM,
RecodeBeamSearch::InitializeWord
, blobs are computed here https://github.com/tesseract-ocr/tesseract/blob/9c2d1aad966eca8af7b615ba181eb3ea50e20576/src/lstm/recodebeam.cpp#L432 using x positions based on the timestep. Now, the start x position where the model started to recognise2
(plus an estimated window) is the start of the green line in the picture. It can be argued that it starts too early and stops too soon (very roughly said like before seeing the end of2
). Moreover, the computed width span is reduced here https://github.com/tesseract-ocr/tesseract/blob/9c2d1aad966eca8af7b615ba181eb3ea50e20576/src/lstm/recodebeam.cpp#L435Then, when constructing the symbol bounding box in
PAGE_RES_IT::ReplaceCurrentWord
this condition is not met https://github.com/tesseract-ocr/tesseract/blob/5fdaa479da2c52526dac1281871db5c4bdaff359/src/ccstruct/pageres.cpp#L1384 because the computedend_x
is too far to the left. For the record, the ends are computed at https://github.com/tesseract-ocr/tesseract/blob/5fdaa479da2c52526dac1281871db5c4bdaff359/src/ccstruct/pageres.cpp#L1313 The result is that the bbox is "unitialised" (max int values for left, -max int for right).The code that should handle this situation is below at https://github.com/tesseract-ocr/tesseract/blob/5fdaa479da2c52526dac1281871db5c4bdaff359/src/ccstruct/pageres.cpp#L1407 However, it fails too because the cblobs are further to right than the wrongly computed blob end.
The unitialised bounding box gets "fixed" when calling
BoundingBox
at https://github.com/tesseract-ocr/tesseract/blob/5fdaa479da2c52526dac1281871db5c4bdaff359/src/ccmain/pageiterator.cpp#L313 setting the coordinates to 0, 0, image height/width.