tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.2k stars 9.51k forks source link

wrong coordinate in LTSM ocr mode and Japanese #1015

Open hoangtocdo90 opened 7 years ago

hoangtocdo90 commented 7 years ago

Hi all i'm using Tesseract for get each char with Coordinate in image . I'm using ResultIterator with OCR MODE =2 (LTSM) and language = jpn.

tesseract::ResultIterator ri = api->GetIterator(); int index_char = 0; vector char_iterators; do { char value = ri->GetUTF8Text(tesseract::RIL_SYMBOL); //unknown value to space if (value == nullptr || value == "")value = " "; float conf = ri->Confidence(tesseract::RIL_SYMBOL); ri->BoundingBox(tesseract::RIL_SYMBOL, &left, &top, &right, &bottom); index_char++; } while (ri->Next(tesseract::RIL_SYMBOL)); api->ClearAdaptiveClassifier();

Here is my program log and input image . You can see in り character i got wrong Coordinate . I tested using tsv and hocr but it's give me same result. Still wrong Coordinate .

Char value = こ left= 15 top = 14 right = 51 bottom = 51 conf = 99 Char value = ん left= 64 top = 9 right = 112 bottom = 54 conf = 99 Char value = ば left= 122 top = 5 right = 171 bottom = 54 conf = 99 Char value = ん left= 176 top = 9 right = 224 bottom = 54 conf = 99 Char value = は left= 234 top = 9 right = 281 bottom = 54 conf = 99 Char value = こ left= 295 top = 14 right = 331 bottom = 51 conf = 99 Char value = ん left= 344 top = 9 right = 392 bottom = 54 conf = 99 Char value = ば left= 402 top = 5 right = 445 bottom = 54 conf = 99 Char value = ん left= 456 top = 9 right = 497 bottom = 54 conf = 99 Char value = は left= 514 top = 9 right = 561 bottom = 54 conf = 99 Char value = ご left= 15 top = 79 right = 58 bottom = 126 conf = 99 Char value = 飯 left= 62 top = 80 right = 113 bottom = 130 conf = 99 Char value = 大 left= 120 top = 80 right = 225 bottom = 130 conf = 99 Char value = 盛 left= 242 top = 83 right = 260 bottom = 130 conf = 99 Char value = り left= 2328 top = 1616 right = 2328 bottom = 1616 conf = 99 Char value = 。 left= 289 top = 116 right = 305 bottom = 131 conf = 99

jpn msgothic exp0 And one more question . I'm try to and fonts in jpn data but may be i must re train from scratch. But i don't know actrually my jpn tessseract data (i'm downloaded from tessdata repository) how to make this? I'm try download data from langdata repository make image from jpn.traintext and train it by using tesstrain.sh and Jtessboxeditor . But i got low accurary than i download from repository. Some body can tell me extractly how to make it! Sorry for my bad english

kandaman commented 7 years ago

i got a same problem. i am using jpn.traindata. i tried RIL_SYMBOL, RIL_WORD. RIL_SYMBOL is better. A critical problem is ---- the recgnized character is so good but the position is too bad. i need the pair of image and character, don't you? If you have new information pls tell me.

thanks

hoangtocdo90 commented 7 years ago

I'm temple fix this by using this way I'm using RIL_SYMBOL. in my case the wrong Coordinate usually appear in a end of lines or end of block res_it->IsAtFinalElement(RIL_TEXTLINE, RIL_WORD) res_it->IsAtFinalElement(RIL_PARA, RIL_WORD) res_it->IsAtFinalElement(RIL_BLOCK, RIL_WORD) when you get a wrong Coordinate you can predict a new coordinate by using the backforward of ResultIterator coordinate

amitdo commented 7 years ago

Char value = こ left= 15 top = 14 right = 51 bottom = 51 conf = 99 Char value = ん left= 64 top = 9 right = 112 bottom = 54 conf = 99 Char value = ば left= 122 top = 5 right = 171 bottom = 54 conf = 99 Char value = ん left= 176 top = 9 right = 224 bottom = 54 conf = 99 Char value = は left= 234 top = 9 right = 281 bottom = 54 conf = 99 Char value = こ left= 295 top = 14 right = 331 bottom = 51 conf = 99 Char value = ん left= 344 top = 9 right = 392 bottom = 54 conf = 99 Char value = ば left= 402 top = 5 right = 445 bottom = 54 conf = 99 Char value = ん left= 456 top = 9 right = 497 bottom = 54 conf = 99 Char value = は left= 514 top = 9 right = 561 bottom = 54 conf = 99 Char value = ご left= 15 top = 79 right = 58 bottom = 126 conf = 99 Char value = 飯 left= 62 top = 80 right = 113 bottom = 130 conf = 99 Char value = 大 left= 120 top = 80 right = 225 bottom = 130 conf = 99 Char value = 盛 left= 242 top = 83 right = 260 bottom = 130 conf = 99 Char value = り left= 2328 top = 1616 right = 2328 bottom = 1616 conf = 99 Char value = 。 left= 289 top = 116 right = 305 bottom = 131 conf = 99

Strange. It looks like a bug.

Shreeshrii commented 7 years ago

Please check if this is fixed by the latest set of commits by Ray.

jpn-1.txt jpn-1.tsv.txt

level   page_num    block_num   par_num line_num    word_num    left    top width   height  conf    text
1   1   0   0   0   0   0   0   2550    470 -1  
2   1   1   0   0   0   104 96  2286    348 -1  
3   1   1   1   0   0   111 96  546 49  -1  
4   1   1   1   1   0   111 96  546 49  -1  
5   1   1   1   1   1   111 105 36  37  96  こ
5   1   1   1   1   2   160 100 48  45  96  ん
5   1   1   1   1   3   218 96  49  49  96  ば
5   1   1   1   1   4   272 100 48  45  96  ん
5   1   1   1   1   5   330 100 47  45  96  は
5   1   1   1   1   6   391 105 36  37  95  こ
5   1   1   1   1   7   440 100 48  45  96  ん
5   1   1   1   1   8   498 96  49  49  96  ば
5   1   1   1   1   9   552 100 48  45  96  ん
5   1   1   1   1   10  610 100 47  45  95  は
3   1   1   2   0   0   111 170 962 52  -1  
4   1   1   2   1   0   111 170 962 52  -1  
5   1   1   2   1   1   111 170 43  47  96  ご
5   1   1   2   1   2   158 171 107 50  95  飯
5   1   1   2   1   3   271 171 50  50  96  大
5   1   1   2   1   4   338 174 29  47  96  盛
5   1   1   2   1   5   0   0   2550    470 96  り
5   1   1   2   1   6   385 207 16  15  96  。
5   1   1   2   1   7   439 172 123 50  93  今
5   1   1   2   1   8   0   0   2550    470 95  年
5   1   1   2   1   9   567 171 65  51  96  は
5   1   1   2   1   10  624 173 87  48  96  初
5   1   1   2   1   11  0   0   2550    470 95  め
5   1   1   2   1   12  722 178 43  41  96  て
5   1   1   2   1   13  776 171 48  50  94  恋
5   1   1   2   1   14  832 173 101 48  96  人
5   1   1   2   1   15  944 171 48  50  96  出
5   1   1   2   1   16  1001    173 26  46  96  来
5   1   1   2   1   17  1021    191 25  28  96  た
5   1   1   2   1   18  1057    207 16  15  96  。
3   1   1   3   0   0   106 245 2284    126 -1  
4   1   1   3   1   0   106 245 2284    51  -1  

jpn

hoangtocdo90 commented 7 years ago

Thank sir

hoangtocdo90 commented 7 years ago

5 1 1 2 1 1 111 170 43 47 96 ご 5 1 1 2 1 2 158 171 107 50 95 飯 5 1 1 2 1 3 271 171 50 50 96 大 5 1 1 2 1 4 338 174 29 47 96 盛 5 1 1 2 1 5 0 0 2550 470 96 り 2 1 10 624 173 87 48 96 初 5 1 1 2 1 11 0 0 2550 470 95 め 5 1 1 2 1 12 722 178 43 41 96 て

Please check this . Still wrong coordinate in り and め character

GHamrouni commented 7 years ago

We are still able to reproduce it in the Arabic language in LSTM mode. Most BBoxes are correct but there are some boxes that contain valid text and wrong coordinates (the region contained in the bbox is empty).

SimonTheBaptist commented 7 years ago

I'm getting the same behavior for Thai language in LSTM - BoundingBox() often returns the whole image size. The image size was 400, 266. Here is a small portion of some results [X1, Y1; X2, Y2]. (As a side note, I'm using RIL_WORD, but it seems to behave like RIL_SYMBOL, I'm not sure why).

'ร' - Confidence: 94.3645 [0, 0; 400, 266] 'ม' - Confidence: 95.7061 [19, 68; 33, 77] 'า' - Confidence: 96.9703 [0, 0; 400, 266] 'ส' - Confidence: 96.976 [35, 67; 50, 77]

wanghaisheng commented 7 years ago

@amitdo sir could you show me where to get more info about how tesseract analyze input image to get the Coordinate of words/character and then recognize them through LSTM or old method and last combine the ocr result word with the coordinate ?

amitdo commented 6 years ago

@wanghaisheng

See here: https://github.com/tesseract-ocr/tesseract/blob/master/lstm/recodebeam.cpp Search for 'box', 'xcoords', 'blob'

aniseddali commented 6 years ago

This happened to me also in Arabic language. Here is an example that reproduce the problem.


struct OcrResult
{
    std::string text;
    cv::Rect box;
};
int main(int argc, char *argv[])
{
    tesseract::TessBaseAPI tesseract ;
    tesseract.Init("./data/tessdata/", "ara", tesseract::OcrEngineMode::OEM_LSTM_ONLY);
    tesseract.SetPageSegMode(tesseract::PageSegMode::PSM_SINGLE_WORD);
    PIX *patch_pix = pixRead(argv[1]);
    tesseract.SetImage(patch_pix);
    tesseract.Recognize(0);
    std::vector<OcrResult> ocrResults;
    tesseract::ResultIterator *ri = tesseract.GetIterator();
    tesseract::PageIteratorLevel level = tesseract::RIL_WORD ;
    if (ri != 0)
    {
        do
        {
            char *word = ri->GetUTF8Text(level);
            int left, top, right, bottom;
            if (ri->BoundingBox(level, &left, &top, &right, &bottom))
            {
                OcrResult res;
                res.box = cv::Rect(left,top,right - left,bottom - top);
                res.text = std::string(word);
                ocrResults.push_back(res);
            }
            delete[] word;
        } while (ri->Next(level));
    }
    cv::Mat image = cv::imread(argv[1]);
    cv::Mat DrawingImg = image.clone();
    for(int i=0;i<ocrResults.size();i++){
        cv::Rect rect = ocrResults[i].box; 
        cv::rectangle(DrawingImg, rect, cv::Scalar(255, 0, 0), 1);
        std::cout<<ocrResults[i].text<<std::endl;
        cv::imshow("DrawingImg",DrawingImg);
        cv::waitKey();
    }
}

I used OpenCV to draw boxes. This image contains two arabic words. The recognition is correct for both words. But the box position of the first word is wrong. (the word in the right) The box is matching some noise on the top of the image.

boxes

And this is the original image.

box

lpatruno commented 6 years ago

I'm also getting this bug for english text, though I can't provide the data files as they contain PII.

atuyosi commented 6 years ago

I got same issue on beta.4 with jpn.traineddata.

In my case, the image size(width, height) and the invalid coordinate value are correlated. Even with the same letter, It's cause incorrect results depending on the position in the image.

$ tesseract -l jpn  'images/sample/test-jpn_01.jpg' stdout tsv | grep 596
Warning. Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 167
1   1   0   0   0   0   0   0   596 118 -1
5   1   1   1   1   10  0   0   596 118 92  、
5   1   1   1   2   4   0   0   596 118 92  字
5   1   1   1   2   10  0   0   596 118 97  て
5   1   1   1   2   15  0   0   596 118 93  す

My test image size is 596x118. The same letter appears multiple times(ex. '字', 'て'), but the value of boundigbox is wrong only once.

test-jpn_01

FYI, In the above image, recognition of the character '日' incorrect by jpn.traineddata( traineddata_fast).

amitdo commented 6 years ago

Same issue as #1192