tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.08k stars 9.39k forks source link

Iterator and basic example results are very different. #2881

Open korhun opened 4 years ago

korhun commented 4 years ago

I get the following results. I'm using the latest codes. ForExample: GetText1 finds "EVENING NEWS" as extra; and GetText2 finds "NORAH O’DONNELL" as extra.

Is this normal? Is there a way that I can get all the found words?

GetText1 result: pr vw MD e ir, “a... a > | al \ (spray ga ğa - ef 2 ie ma 04 - ml is ve q . > e > > = | 2 “ 3 YE e eae | 8 o r ii b Ns a Ş = İğ a ay to a . | ’ E He Pe / ene a 200 MILLION AMERICANS IN PATH OF POWERFUL WINTER STORM (EN ©CBS EVENING NEWS A

GetText2 result: e ga)» “©. SATELLİTE- RADAR LOOP & a. Mim e | + a, a a, 200 MILLION AMERICANS IN PATH OF POWERFUL WINTER STORM | ~~ > NIN wt NORAH O’DONNELL

(This attached file is a tiff. I had to change its extension name to jpeg in order to upload.) sample

  const char* GetText1(const char* input) {
      char* outText;
      Pix* image = pixRead(input);
      api->SetImage(image);        
      outText = api->GetUTF8Text();      
      pixDestroy(&image);
      return outText;
  }
  const char* GetText2(const char* input) {
      std::string outTextStr;
      Pix* image = pixRead(input);
      api->SetImage(image);
      Boxa* boxes = api->GetComponentImages(tesseract::RIL_TEXTLINE, true, NULL, NULL);
      if (boxes != NULL) {
          for (int i = 0; i < boxes->n; i++) {
              BOX* box = boxaGetBox(boxes, i, L_CLONE);
              api->SetRectangle(box->x, box->y, box->w, box->h);

              const char* c = api->GetUTF8Text();
              if ((c != NULL) && (c[0] != '\0')) {
                  std::string s = c;
                  outTextStr += s + " ";
              }
              delete[] c;

              boxDestroy(&box);
          }
      }
      pixDestroy(&image);
      return outTextStr.c_str();
  }
korhun commented 4 years ago

sample.zip If you have a problem reaching the sample tiff file, here is a zip file (it is originally a rar, I've changed the extension name in order to upload :) I don't have a winzip sorry)

korhun commented 4 years ago

Hey, sorry, the title is wrong. It should be not the "Iterator Example" but "GetComponentImages Example".

This function gives exactly same result (words) with the GetText1 function:

` std::string outTextStr;
      Pix* image = pixRead(input);
      api->SetImage(image);
      api->Recognize(0);
      tesseract::ResultIterator* ri = api->GetIterator();
      tesseract::PageIteratorLevel level = tesseract::RIL_WORD;
      if (ri != 0) {
          do {
              const char* c = ri->GetUTF8Text(level);
              if ((c != NULL) && (c[0] != '\0')) {
                  std::string s = c;
                  outTextStr += s + " ";
              }
              delete[] c;
          } while (ri->Next(level));
      }
      pixDestroy(&image);
      return outTextStr.c_str();
Shreeshrii commented 4 years ago

This is the image

sample