tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.56k stars 9.44k forks source link

Wrong coordinates on character level #2521

Closed kamrapooja closed 5 years ago

kamrapooja commented 5 years ago

Hi, I am creating a project to extract coordinates on character level. Sample code: do { OCR.OCRCHAR OCRChar = new OCR.OCRCHAR(); string symbol = iter.GetText(PageIteratorLevel.Symbol); OCRChar.sValue = symbol[0]; OCRChar.fConfidence = iter.GetConfidence(PageIteratorLevel.Symbol); Rect sym_bound = new Rect(); iter.TryGetBoundingBox(PageIteratorLevel.Symbol, out sym_bound); }while (iter.Next(PageIteratorLevel.Word, PageIteratorLevel.Symbol));

I have passed an image containing digits only. e.g. 6875 Left coordinate of 8 is less than right coordinate of 6 and same for other digits.

PFA file for the reference final_0 Box file output : 6 42 10 55 28 0 8 56 10 69 29 0 3 57 10 81 30 0 5 70 11 97 30 0

Shreeshrii commented 5 years ago

Neural network/LSTm based tesseract trains on line images and will not provide accurate character level boxes.

kamrapooja commented 5 years ago

what should i do to get accurate character coordinates?

stweil commented 5 years ago

@noahmetzger, will the character coordinates be better with your pull request #2554, or was that different code?

noahmetzger commented 5 years ago

i don't think so when those coordinates only relate to the line images. We will have better coordinates for the line images but still the same problem for the big picture.

noahmetzger commented 5 years ago

actually the algorithms from my choice_mode approach can be used to improve the bounding boxes. I will prepare a pull request tomorrow. Here is your picture with the old bounding box algorithm rbbOld

Here it is with the new one rbbNew

Shreeshrii commented 5 years ago

@noahmetzger This is great.

Please also check whether it will fix these other related issues.

https://github.com/tesseract-ocr/tesseract/issues/2024 https://github.com/tesseract-ocr/tesseract/issues/1276 https://github.com/tesseract-ocr/tesseract/issues/1883

kamrapooja commented 5 years ago

Thats great. Please tell me what changes i need to do for the same

stweil commented 5 years ago

This should be fixed by pull request #2576.

Shreeshrii commented 5 years ago

Please tell me what changes i need to do for the same

Build tesseract from latest commit in master branch.

Here are the results with your image, note the change in the 3rd and 4th line.

 tesseract 6835.jpg  - -l eng  --tessdata-dir ~/tessdata_best --dpi 300 --oem 1 makebox

6 42 10 55 28 0
8 56 10 69 29 0
3 70 11 82 29 0
5 84 12 97 30 0

Here is the old output as reported by you:

Box file output :
6 42 10 55 28 0
8 56 10 69 29 0
3 57 10 81 30 0
5 70 11 97 30 0
Shreeshrii commented 5 years ago
tesseract 6835.jpg  - -l eng  --tessdata-dir ~/tessdata_best --dpi 300 --oem 1 -c hocr_char_boxes=1     hocr
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
  <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
  <meta name='ocr-system' content='tesseract 5.0.0-alpha-322-g74ac' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
 </head>
 <body>
  <div class='ocr_page' id='page_1' title='image "6835.jpg"; bbox 0 0 103 50; ppageno 0'>
   <div class='ocr_carea' id='block_1_1' title="bbox 42 20 97 40">
    <p class='ocr_par' id='par_1_1' lang='eng' title="bbox 42 20 97 40">
     <span class='ocr_header' id='line_1_1' title="bbox 42 20 97 40; baseline -0.036 0; x_size 24.888889; x_descenders 6.2222223; x_ascenders 6.2222223">
      <span class='ocrx_word' id='word_1_1' title='bbox 42 20 97 40; x_wconf 96'>
       <span class='ocrx_cinfo' title='x_bboxes 42 22 55 40; x_conf 99.55127'>6</span>
       <span class='ocrx_cinfo' title='x_bboxes 56 21 69 40; x_conf 99.571632'>8</span>
       <span class='ocrx_cinfo' title='x_bboxes 70 21 82 39; x_conf 99.567482'>3</span>
       <span class='ocrx_cinfo' title='x_bboxes 84 20 97 38; x_conf 99.495033'>5</span>
      </span>
     </span>
    </p>
   </div>
  </div>
 </body>
</html>
Shreeshrii commented 5 years ago

@noahmetzger Thank you for fixing this.

@zdenop The issue can be closed.