tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.36k stars 9.52k forks source link

Integer division by zero in TabVector::Evaluate #3760

Open nullpointersetc opened 2 years ago

nullpointersetc commented 2 years ago

Environment

Current Behavior:

On a certain image, an integer division-by-zero exception occurs and the OCR program using Tesseract as a library is terminated.

We have determined that the problem is in method TabVector::Evaluate in src/textord/tabvector.cpp, and specifically in this section of the code:

  // If there has been a good box, adjust the end.
  if (prev_good_box != nullptr) {
    SetYEnd(prev_good_box->top());
    // Compute the percentage of the vector that is occupied by good boxes.
    int length = endpt_.y() - startpt_.y();
    percent_score_ = 100 * good_length / length;
    if (num_deleted_boxes > 0) {
      needs_refit_ = true;
      FitAndEvaluateIfNeeded(vertical, finder);
      if (boxes_.empty())
        return;
    }
    ...
}

There is no validation before the assignment to percentscore that length is not zero (i.e., that endpt.y() does not equal startpt.y()).

Expected Behavior:

The integer division is not attempted and the process does not abort.

Suggested Fix:

    percent_score_ = length == 0 ? 0 : 100 * good_length / length;
stweil commented 2 years ago

Could you please provide an image which triggers this division by zero?

It does not make sense to simply add a check for the division. First we have to analyse why this function is called with endpt_.y() == startpt_.y() (so it is a point, not a vector).

wollmers commented 2 years ago

Could you please provide an image which triggers this division by zero?

It does not make sense to simply add a check for the division. First we have to analyse why this function is called with endpt_.y() == startpt_.y() (so it is a point, not a vector).

In case of a point the length should be 1.

nullpointersetc commented 2 years ago

I currently don't have an image that I can give you.

stweil commented 2 years ago

@nullpointersetc, maybe some part of an image which can be published is sufficient to trigger the issue, or you can send me a confidential image per e-mail. I am afraid that we have to close the issue without a fix if there is no test case.

nullpointersetc commented 2 years ago

I don't know how to construct such an image.

For example, if I try to construct such an image with this text: Fury_Road_2

I get back that the image is 2548 x 3298 (the original image was 1019 x 1319 at 120 DPI, so that may be explained)

TabVector::Evaluate is called only four times for this image. At the if statement I indicated, the values are:

  1. startpt={xcoord=234, ycoord=971}, endpt={xcoord=234, ycoord=3026}, and prev_good_box={bot_left={xcoord=237 ycoord=2994 } top_right={xcoord=272 ycoord=3026 } }

  2. startpt={xcoord=2270 ycoord=961 }, endpt={xcoord=2270 ycoord=3016 }, and prev_good_box = {bot_left={xcoord=2147 ycoord=2994 } top_right={xcoord=2167 ycoord=3016 } }

  3. startpt = {xcoord=234 ycoord=971 }, endpt = {xcoord=234 ycoord=3026 }, and prev_good_box={bot_left={xcoord=237 ycoord=2994 } top_right={xcoord=272 ycoord=3026 } }

  4. startpt = {xcoord=2270 ycoord=961 }, endpt = {xcoord=2270 ycoord=3016 }, and prev_good_box={bot_left={xcoord=2147 ycoord=2994 } top_right={xcoord=2167 ycoord=3016 } }

I DO NOT know how to interpret these numbers. I would have assumed that these are number of pixels from the top-left of the image, but the startpt and endpt all seem to refer to a vertical region of the screen that's one pixel wide and consist of only white pixels, while the good boxes appear to be all white pixels. Am I going along the right path in trying to come up with an image?

amitdo commented 2 years ago

Did you try to use version 5.1.0 with the same image?

amitdo commented 2 years ago

In the beginning of this method length == 0 is checked as part of a condition.

https://github.com/tesseract-ocr/tesseract/blob/76faf1600643f45f22555dcbc5d39e93f96825d6/src/textord/tabvector.cpp#L580-L589

stweil commented 2 years ago

I don't expect that 5.1.0 or our latest code fixed this issue. @nullpointersetc, it would really help if you could provide an image which triggers the bug. You can send it to my personal e-mail address, and I will keep it private.

stweil commented 2 years ago

@nullpointersetc, it would also be interesting whether the same bug also occurs on Linux or MacOS. Could you please test it (that's also possible on Windows with WSL)?