tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.12k stars 9.39k forks source link

Baselines Incorrect Using Certain PSMs (`PageIterator::Baseline`) #4304

Closed Balearica closed 1 month ago

Balearica commented 1 month ago

Overview

When using certain PSMs with certain inputs, the PageIterator::Baseline function produces results that are incorrect due to a bug when getting line bounding boxes. I noticed this when using psm 8 (single word). This impacts API users trying to get a line's baseline, and also causes incorrect results in CLI output formats that report baseline (.hocr).

Reproducible Example

While this is most noticeable using it->Baseline through the API, the phenomenon can be demonstrated using the CLI with the example image below.

simple_c2

The word in the image is recognized correctly--including having the same bounding box--whether psm is set to 6 (single block) or 8 (single word). However, the latter does not calculate the baseline correctly.

When setting psm to 6, the baseline attribute is set to -0.036 0, which is correct.

simple_c2_line

tesseract simple_c2.png stdout --oem 0 --psm 6 hocr
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
  <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
  <meta name='ocr-system' content='tesseract 5.1.0-471-gbc490' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
 </head>
 <body>
  <div class='ocr_page' id='page_1' title='image "simple_c2.png"; bbox 0 0 328 194; ppageno 0; scan_res 96 96'>
   <div class='ocr_carea' id='block_1_1' title="bbox 85 84 195 104">
    <p class='ocr_par' id='par_1_1' lang='eng' title="bbox 85 84 195 104">
     <span class='ocr_line' id='line_1_1' title="bbox 85 84 195 104; baseline -0.036 0; x_size 24.310345; x_descenders 5.3103447; x_ascenders 5">
      <span class='ocrx_word' id='word_1_1' title='bbox 85 84 195 104; x_wconf 83'>Tesseract</span>
     </span>
    </p>
   </div>
  </div>
 </body>
</html>

However, when setting psm to 8 the baseline attribute is set to -0 -2.005, which is incorrect.

simple_c2_line

tesseract simple_c2.png stdout --oem 0 --psm 8 hocr
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
  <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
  <meta name='ocr-system' content='tesseract 5.1.0-471-gbc490' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
 </head>
 <body>
  <div class='ocr_page' id='page_1' title='image "simple_c2.png"; bbox 0 0 328 194; ppageno 0; scan_res 96 96'>
   <div class='ocr_carea' id='block_1_1' title="bbox 85 84 195 104">
    <p class='ocr_par' id='par_1_1' lang='eng' title="bbox 85 84 195 104">
     <span class='ocr_line' id='line_1_1' title="bbox 85 84 195 104; baseline -0 -2.005; x_size 24.310345; x_descenders 5.3103447; x_ascenders 5">
      <span class='ocrx_word' id='word_1_1' title='bbox 85 84 195 104; x_wconf 83'>Tesseract</span>
     </span>
    </p>
   </div>
  </div>
 </body>
</html>

Cause

I investigated, and the root cause is that the PageIterator::Baseline function assumes that the line's bounding box has already been calculated, however this is not always the case. The PageIterator::Baseline gets the line's bounding box using row->bounding_box(), which does not force these values to be calculated--it simply returns the default values (-32767 or 32767) if they were not calculated already.
https://github.com/tesseract-ocr/tesseract/blob/215b023c43f67a52fe4c9f783988503529f5c6dd/src/ccmain/pageiterator.cpp#L534-L542

This can be confirmed by adding tprintf statements within the PageIterator::Baseline function:

bool PageIterator::Baseline(PageIteratorLevel level, int *x1, int *y1, int *x2,
                            int *y2) const {
  if (it_->word() == nullptr) {
    return false; // Already at the end!
  }
  ROW *row = it_->row()->row;
  WERD *word = it_->word()->word;
  TBOX box = (level == RIL_WORD || level == RIL_SYMBOL) ? word->bounding_box()
                                                        : row->bounding_box();
  tprintf("Box: %d,%d -> %d,%d\n", box.left(), box.bottom(), box.right(),
          box.top());
  int left = box.left();
  ICOORD startpt(left, static_cast<int16_t>(row->base_line(left) + 0.5));
  int right = box.right();
  ICOORD endpt(right, static_cast<int16_t>(row->base_line(right) + 0.5));
  // Rotate to image coordinates and convert to global image coords.
  startpt.rotate(it_->block()->block->re_rotation());
  endpt.rotate(it_->block()->block->re_rotation());
  *x1 = startpt.x() / scale_ + rect_left_;
  *y1 = (rect_height_ - startpt.y()) / scale_ + rect_top_;
  *x2 = endpt.x() / scale_ + rect_left_;
  *y2 = (rect_height_ - endpt.y()) / scale_ + rect_top_;
  tprintf("Baseline: (%d,%d)->(%d,%d)\n", *x1, *y1, *x2, *y2);
  return true;
}

When run with psm set to 8 this produces the following:

Box: 32767,32767 -> -32767,-32767
Baseline: (32767,-990)->(-32767,1342)

Potential Fixes

I think there are 3 potential approaches for fixing:

  1. Add an ad-hoc check within PageIterator::Baseline for whether the default value is being returned, and if it is, calculate the actual bounding box.
    • This would fix the issue, however may leave other bugs related to the bounding box never being calculated outstanding.
  2. Modify the row->bounding_box() function to calculate the bounding box if it has never been calculated before.
  3. Figure out why row bounding boxes are not being calculated with specific psm settings, and edit so they are being calculated upon creation.

Environment

Ubuntu 22.04 Jammy

tesseract 5.1.0-471-gbc490
 leptonica-1.82.0
  libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.1) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
Balearica commented 1 month ago

I investigated further and actually believe this has a simple and obvious fix. Will open a PR shortly.