tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.36k stars 9.52k forks source link

GetComponentImages cannot get all rectangles of texts #1231

Open Juddd opened 6 years ago

Juddd commented 6 years ago

Environment

Current Behavior:

I know this question is belong of the forum: http://groups.google.com/group/tesseract-ocr. But I think this is a bug of GetComponentImages maybe, so I post it here. I want to find all rectangle of those symbols by GetComponentImages. This is my code(Actually from oficial documentation):

#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>
//#include <opencv.hpp>
//using namespace cv;

int main()
{
    char *str = "targetimage.png";
    Pix *image = pixRead(str);
    //Mat mat_image = imread(str);
    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
    api->Init(NULL, "eng");
    api->SetImage(image);
    api->SetSourceResolution(300);
    Boxa* boxes = api->GetComponentImages(tesseract::RIL_SYMBOL, false, NULL, NULL);
    printf("Found %d textline image components.\n", boxes->n);
    for (int i = 0; i < boxes->n; i++) {
        BOX* box = boxaGetBox(boxes, i, L_CLONE);
        api->SetRectangle(box->x, box->y, box->w, box->h);
        //rectangle(mat_image, Rect(box->x, box->y, box->w, box->h), Scalar(0,0,255), -1);
        char* ocrResult = api->GetUTF8Text();
        int conf = api->MeanTextConf();
        fprintf(stdout, "Box[%d]: x=%d, y=%d, w=%d, h=%d, confidence: %d, text: %s",
            i, box->x, box->y, box->w, box->h, conf, ocrResult);
    }

    // Destroy used object and release memory
    api->End();
    pixDestroy(&image);

    return 0;
}

If the "targetimage.png" is this

Then I will get these rectangles

Note I have used Opencv to show the rectangles here.

If the "targetimage.png" is this

I will get these rectangles

As you see, every time the Tesseract will omit some rectangles. Is it a bug of function GetComponentImages? It if it, how to fix it to implement my target? I know we are some method based on other librarys can do this. But in my case, I have to use Tesseract. This question confused me very long time. Please give me some information.

Expected Behavior:

I hope the GetComponentImages help me get all rectangle.

Juddd commented 6 years ago

@egorpugin I'm sure this is a bug of function GetComponentImages, please fix. Mybe GetConnectedComponents can save me, but I don't know how to use it..