When using certain PSMs with certain inputs, the PageIterator::Baseline function produces results that are incorrect due to a bug when getting line bounding boxes. I noticed this when using psm8 (single word). This impacts API users trying to get a line's baseline, and also causes incorrect results in CLI output formats that report baseline (.hocr).
Reproducible Example
While this is most noticeable using it->Baseline through the API, the phenomenon can be demonstrated using the CLI with the example image below.
The word in the image is recognized correctly--including having the same bounding box--whether psm is set to 6 (single block) or 8 (single word). However, the latter does not calculate the baseline correctly.
When setting psm to 6, the baseline attribute is set to -0.036 0, which is correct.
I investigated, and the root cause is that the PageIterator::Baseline function assumes that the line's bounding box has already been calculated, however this is not always the case. The PageIterator::Baseline gets the line's bounding box using row->bounding_box(), which does not force these values to be calculated--it simply returns the default values (-32767 or 32767) if they were not calculated already. https://github.com/tesseract-ocr/tesseract/blob/215b023c43f67a52fe4c9f783988503529f5c6dd/src/ccmain/pageiterator.cpp#L534-L542
This can be confirmed by adding tprintf statements within the PageIterator::Baseline function:
bool PageIterator::Baseline(PageIteratorLevel level, int *x1, int *y1, int *x2,
int *y2) const {
if (it_->word() == nullptr) {
return false; // Already at the end!
}
ROW *row = it_->row()->row;
WERD *word = it_->word()->word;
TBOX box = (level == RIL_WORD || level == RIL_SYMBOL) ? word->bounding_box()
: row->bounding_box();
tprintf("Box: %d,%d -> %d,%d\n", box.left(), box.bottom(), box.right(),
box.top());
int left = box.left();
ICOORD startpt(left, static_cast<int16_t>(row->base_line(left) + 0.5));
int right = box.right();
ICOORD endpt(right, static_cast<int16_t>(row->base_line(right) + 0.5));
// Rotate to image coordinates and convert to global image coords.
startpt.rotate(it_->block()->block->re_rotation());
endpt.rotate(it_->block()->block->re_rotation());
*x1 = startpt.x() / scale_ + rect_left_;
*y1 = (rect_height_ - startpt.y()) / scale_ + rect_top_;
*x2 = endpt.x() / scale_ + rect_left_;
*y2 = (rect_height_ - endpt.y()) / scale_ + rect_top_;
tprintf("Baseline: (%d,%d)->(%d,%d)\n", *x1, *y1, *x2, *y2);
return true;
}
When run with psm set to 8 this produces the following:
Overview
When using certain PSMs with certain inputs, the
PageIterator::Baseline
function produces results that are incorrect due to a bug when getting line bounding boxes. I noticed this when usingpsm
8
(single word). This impacts API users trying to get a line's baseline, and also causes incorrect results in CLI output formats that report baseline (.hocr
).Reproducible Example
While this is most noticeable using
it->Baseline
through the API, the phenomenon can be demonstrated using the CLI with the example image below.The word in the image is recognized correctly--including having the same bounding box--whether
psm
is set to6
(single block) or8
(single word). However, the latter does not calculate the baseline correctly.When setting
psm
to6
, the baseline attribute is set to-0.036 0
, which is correct.However, when setting
psm
to8
the baseline attribute is set to-0 -2.005
, which is incorrect.Cause
I investigated, and the root cause is that the
PageIterator::Baseline
function assumes that the line's bounding box has already been calculated, however this is not always the case. ThePageIterator::Baseline
gets the line's bounding box usingrow->bounding_box()
, which does not force these values to be calculated--it simply returns the default values (-32767
or32767
) if they were not calculated already.https://github.com/tesseract-ocr/tesseract/blob/215b023c43f67a52fe4c9f783988503529f5c6dd/src/ccmain/pageiterator.cpp#L534-L542
This can be confirmed by adding
tprintf
statements within thePageIterator::Baseline
function:When run with
psm
set to8
this produces the following:Potential Fixes
I think there are 3 potential approaches for fixing:
PageIterator::Baseline
for whether the default value is being returned, and if it is, calculate the actual bounding box.row->bounding_box()
function to calculate the bounding box if it has never been calculated before.psm
settings, and edit so they are being calculated upon creation.Environment
Ubuntu 22.04 Jammy