tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.26k stars 9.51k forks source link

segments extra, overlapping line from upper part of a line #3611

Open wollmers opened 3 years ago

wollmers commented 3 years ago

Environment

Current Behavior:

Segments an extra line which results in noise.

Image sample:

Breitkopf-Fraktur_18g lines_1-3

hOCR result with options --psm 6 --oem 1:

     <span class='ocr_line' id='line_1_1' title="bbox 240 37 855 54; baseline 0.007 -8; x_size 19.661276; x_descenders 4; x_ascenders 5.6612754">
      <span class='ocrx_word' id='word_1_1' title='bbox 240 40 243 46; x_wconf 13'>
       <span class='ocrx_cinfo' title='x_bboxes 240 40 243 46; x_conf 87.589653'>„</span>
      </span>
      <span class='ocrx_word' id='word_1_2' title='bbox 296 40 299 46; x_wconf 0'>
       <span class='ocrx_cinfo' title='x_bboxes 296 40 299 46; x_conf 81.0252'>„</span>
      </span>
      <span class='ocrx_word' id='word_1_3' title='bbox 551 43 554 49; x_wconf 0'>
       <span class='ocrx_cinfo' title='x_bboxes 551 43 554 49; x_conf 80.892883'>‚</span>
      </span>
      <span class='ocrx_word' id='word_1_4' title='bbox 693 37 709 45; x_wconf 34'>
       <span class='ocrx_cinfo' title='x_bboxes 693 37 709 45; x_conf 90.693733'>„</span>
      </span>
      <span class='ocrx_word' id='word_1_5' title='bbox 846 43 855 54; x_wconf 0'>
       <span class='ocrx_cinfo' title='x_bboxes 846 43 855 54; x_conf 78.389648'>8</span>
      </span>
     </span>
     <span class='ocr_line' id='line_1_2' title="bbox 141 39 943 115; baseline -0.001 -17; x_size 77; x_descenders 18; x_ascenders 14">
      <span class='ocrx_word' id='word_1_6' title='bbox 141 39 943 115; x_wconf 89'>
       <span class='ocrx_cinfo' title='x_bboxes 141 40 199 98; x_conf 99.027771'>O</span>
       <span class='ocrx_cinfo' title='x_bboxes 179 39 208 115; x_conf 99.034729'>r</span>
       <span class='ocrx_cinfo' title='x_bboxes 206 51 248 96; x_conf 99.041565'>i</span>
       <span class='ocrx_cinfo' title='x_bboxes 251 50 282 109; x_conf 99.029755'>g</span>
       <span class='ocrx_cinfo' title='x_bboxes 290 52 305 96; x_conf 99.040573'>i</span>
       <span class='ocrx_cinfo' title='x_bboxes 310 51 339 96; x_conf 99.027817'>n</span>

[...]

Image with baselines (blue) and line bounding boxes (green):

Breitkopf-Fraktur_18g lines_1-3 oem1 psm6 box

The first line in a single image

Breitkopf-Fraktur_18g line_1

segments perfect with the same --psm 6 --oem 1 options:

Breitkopf-Fraktur_18g line_1 oem1 psm6 box

Expected Behavior:

Segmentation should work in simple and obvious cases. The example has no overlapping, skewed, warped or connected lines.

Suggested Fix:

The whole segmentation process needs a deep review and tests for the most common cases.

amitdo commented 3 years ago

Yes, Tesseract sometimes fails even with easy cases.

Apart from the layout analysis issue, x_wconf also looks strange.

wollmers commented 3 years ago

@amitdo

Apart from the layout analysis issue, x_wconf also looks strange.

Sure. The original page has 19 lines and it's even more crazy.

With --oem 1 I get

1711 P F.KD kt ‚
Original-Breitkopf-Fraktur

and hOCR for the words F.KD kt in the first line:

      <span class='ocrx_word' id='word_1_3' title='bbox 830 41 935 71; x_wconf 0; x_fsize 30'>
       <span class='ocrx_cinfo' title='x_bboxes 830 41 846 49; x_conf 93.069252'>F</span>
       <span class='ocrx_cinfo' title='x_bboxes 848 41 870 71; x_conf 85.260864'>.</span>
       <span class='ocrx_cinfo' title='x_bboxes 869 44 906 71; x_conf 86.037674'>K</span>
       <span class='ocrx_cinfo' title='x_bboxes 927 57 935 69; x_conf 80.748856'>D</span>
      </span>
      <span class='ocrx_word' id='word_1_4' title='bbox 972 44 1016 102; x_wconf 92; x_fsize 30'>
       <span class='ocrx_cinfo' title='x_bboxes 972 44 992 102; x_conf 98.871193'>k</span>
       <span class='ocrx_cinfo' title='x_bboxes 997 51 1016 102; x_conf 99.022659'>t</span>
      </span>

x_wconf is not reliable. It's not obvious how it is calculated.

With --oem 0 I get

I « o i« « « o O kt i
Original-Buitkops-z31a m

and hOCR for the word kt in the first line:

      <span class='ocrx_word' id='word_1_9' title='bbox 972 44 1016 102; x_wconf 76; x_font swe.fontfile_7; x_fsize 7'>
       <span class='ocrx_cinfo' title='x_bboxes 972 44 992 102; x_conf 76.119217'>k</span>
       <span class='ocrx_cinfo' title='x_bboxes 997 51 1016 102; x_conf 78.345688'>t</span>
      </span>

With --oem 0 x_wconf is always the minimum of the included character x_confs.

amitdo commented 3 years ago

https://github.com/tesseract-ocr/tesseract/blob/6998c0ed71802b8eaf1318d9374ad30fe94eae91/src/api/hocrrenderer.cpp#L249-L250

https://github.com/tesseract-ocr/tesseract/blob/255d7c967516ec25628e672375e1d851df0bd82e/src/ccmain/ltrresultiterator.cpp#L95-L107

https://github.com/tesseract-ocr/tesseract/blob/255d7c967516ec25628e672375e1d851df0bd82e/src/ccmain/ltrresultiterator.cpp#L135-L142

Shreeshrii commented 2 years ago
$ tesseract -v
tesseract 5.0.1
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found NEON
 Found OpenMP 201511
 Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1
 Found libcurl/7.58.0 NSS/3.35 zlib/1.2.11 libidn2/2.0.4 libpsl/0.19.1 (+libidn2/2.0.4) nghttp2/1.30.0 librtmp/2.3
$ tesseract extraline.png -
' ' ' 4 al (y
Original-Breitfopf-Fraftur
nach dev von Gottlob Inumanuel Breitfopf gegen 1750 gefdhnittenen Vreitbopf = Fraftur
Hevgeftellt von der Schriftgicfierei HH. Berthold AG in Leipyig.