tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.23k stars 9.51k forks source link

Text fragments of neighboring lines repeated #3489

Open wollmers opened 3 years ago

wollmers commented 3 years ago

Environment

Current Behavior:

Repeats parts of preceding or following line.

Looks like some memory constructs are not cleaned.

Expected Behavior:

Should create text straight ahead only from the same line.

Example:

Image https://github.com/wollmers/ocr-tess-issues/blob/main/issues/issue_3083_binarisation/isisvonoken1826oken_0137.jp2

Processed with --psm 4 and variation of thresholding_method.

Diff GRT versus OCR: https://github.com/wollmers/ocr-tess-issues/blob/main/issues/issue_3083_binarisation/isisvonoken1826oken_0137.psm4.thresh0.diff.txt

# see HERE ------------------------------------------------------------------vvvvvvvvvvvvvvvvvv
☛ Damit ſich Niemand vergeblich bemuͤhe, ſo wird hiemit angezeigt, daß in die Jſis keine politi⸗¶
--|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||-|-~--|~---------||-| 0.812
__Damit ſich Niemand vergeblich bemuͤhe, ſo wird hiemit angezeigt, daß in di_ _j__ j_________ti_¶

ſchen Aufſaͤtze aufgenommen werden.__________________¶
-||||||||~~|||||||||||||||||||||||++++++++++++++++++| 0.604
_chen Auffätze aufgenommen werden. Iſs keine voliti⸗¶
# and HERE ------------------------AAAAAAAAAAAAAAAAA

https://github.com/wollmers/ocr-tess-issues/blob/main/issues/issue_3083_binarisation/isisvonoken1826oken_0137.psm4.thresh1.diff.txt

# see HERE ---------------------------------------------------------------------------vvvvvvvvv
☛_ Damit ſich Niemand vergeblich bemuͤhe, ſo wird hiemit angezeigt, daß in die Jſis keine politi⸗_¶
~+||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||-~--------|---|||~+| 0.827
E. Damit ſich Niemand vergeblich bemuͤhe, ſo wird hiemit angezeigt, daß in die _j________ ___iti/,¶

ſchen Aufſaͤtze aufgenommen werden._______¶
|||||||~|~~|||||||||||||||||||||||+++++++| 0.762
ſchen Aüffäͤtze aufgenommen werden. ö titi¶
# and HERE ------------------------AAAAAA

https://github.com/wollmers/ocr-tess-issues/blob/main/issues/issue_3083_binarisation/isisvonoken1826oken_0137.psm4.thresh2.diff.txt

_____________________________________________________________
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 0.000
D, Damit ſich Niemand vergeblich bemuͤ ird hiemi i in di f ki¶

☛___ Damit ſich Ni__e___man___d verg___eblich bemuͤhe, ſo wird hiem_it angezeigt, daß in die Jſis keine politi⸗¶
~+++|~|-~|--~~-|~~++|+++|~|+++|--|-~+++||||||||||~||||||||||||~||~+|||||||~||~|-||||||||||||~||||||||||~|||||~| 0.640
ſwen Aa_ſt__te_ aufgenommen Wed__e_n⸗ aeblich bemühe, ſo wird bieniit angeleist_ daß in die Iſis keine voliti/¶

ſchen Aufſaͤtze__ aufgenommen werden.______________________________________¶
~----|~~~~~~~~++|~~~--------|~~----|++++++++++++++++++++++++++++++++++++++| 0.067
—____ —!-T-T—ͤ—ê ͤ b——________ ö'____.'ꝛ —¼—¼—¼T¶¶¶¶¶¶P¶ ‚»ÜX(X—j‚88———3————¶

As another symptom the bounding boxes of the lines overlap vertically in hOCR, i. e. they are wrong calculated by Tesseract.

https://github.com/wollmers/ocr-tess-issues/blob/main/issues/issue_3083_binarisation/isisvonoken1826oken_0137.psm4.thresh2.hocr

     <span class='ocr_line' id='line_1_21' title="bbox 150 2356 1827 2392; baseline 0.002 -7; x_size 37.795956; x_descenders 5.7959566; x_ascenders 12">
      <span class='ocrx_word' id='word_1_163' title='bbox 150 2356 217 2392; x_wconf 5'>D,</span>
      <span class='ocrx_word' id='word_1_164' title='bbox 236 2352 358 2399; x_wconf 64'>Damit</span>
      <span class='ocrx_word' id='word_1_165' title='bbox 373 2352 444 2399; x_wconf 92'>ſich</span>
      <span class='ocrx_word' id='word_1_166' title='bbox 463 2352 585 2399; x_wconf 91'>Niemand</span>
      <span class='ocrx_word' id='word_1_167' title='bbox 585 2359 736 2390; x_wconf 91'>vergeblich</span>
      <span class='ocrx_word' id='word_1_168' title='bbox 794 2358 863 2390; x_wconf 93'>bemuͤ</span>
      <span class='ocrx_word' id='word_1_169' title='bbox 1025 2359 1030 2364; x_wconf 93'>ird</span>
      <span class='ocrx_word' id='word_1_170' title='bbox 1103 2358 1162 2364; x_wconf 92'>hiemi</span>
      <span class='ocrx_word' id='word_1_171' title='bbox 1287 2361 1294 2367; x_wconf 55'>i</span>
      <span class='ocrx_word' id='word_1_172' title='bbox 1447 2363 1454 2369; x_wconf 91'>in</span>
      <span class='ocrx_word' id='word_1_173' title='bbox 1511 2363 1518 2369; x_wconf 91'>di</span>
      <span class='ocrx_word' id='word_1_174' title='bbox 1656 2366 1663 2373; x_wconf 35'>f</span>
      <span class='ocrx_word' id='word_1_175' title='bbox 1784 2370 1827 2376; x_wconf 0'>ki</span>
     </span>
     <span class='ocr_line' id='line_1_22' title="bbox 30 2349 1846 2436; baseline -0.01 -8; x_size 43.665352; x_descenders 6.6653504; x_ascenders 14">
      <span class='ocrx_word' id='word_1_176' title='bbox 30 2392 117 2430; x_wconf 16'>ſwen</span>
      <span class='ocrx_word' id='word_1_177' title='bbox 142 2349 258 2436; x_wconf 15'>Aaſtte</span>
      <span class='ocrx_word' id='word_1_178' title='bbox 277 2363 487 2433; x_wconf 76'>aufgenommen</span>
      <span class='ocrx_word' id='word_1_179' title='bbox 497 2366 624 2423; x_wconf 0'>Weden⸗</span>
      <span class='ocrx_word' id='word_1_180' title='bbox 644 2358 765 2395; x_wconf 52'>aeblich</span>
      <span class='ocrx_word' id='word_1_181' title='bbox 811 2360 915 2395; x_wconf 77'>bemühe,</span>
      <span class='ocrx_word' id='word_1_182' title='bbox 943 2359 970 2394; x_wconf 88'>ſo</span>
      <span class='ocrx_word' id='word_1_183' title='bbox 997 2360 1062 2391; x_wconf 86'>wird</span>
      <span class='ocrx_word' id='word_1_184' title='bbox 1082 2360 1175 2394; x_wconf 48'>bieniit</span>
      <span class='ocrx_word' id='word_1_185' title='bbox 1192 2364 1340 2398; x_wconf 19'>angeleist</span>
      <span class='ocrx_word' id='word_1_186' title='bbox 1368 2363 1418 2399; x_wconf 88'>daß</span>
      <span class='ocrx_word' id='word_1_187' title='bbox 1444 2371 1473 2396; x_wconf 77'>in</span>
      <span class='ocrx_word' id='word_1_188' title='bbox 1491 2365 1532 2397; x_wconf 89'>die</span>
      <span class='ocrx_word' id='word_1_189' title='bbox 1548 2364 1608 2413; x_wconf 82'>Iſis</span>
      <span class='ocrx_word' id='word_1_190' title='bbox 1625 2366 1697 2402; x_wconf 92'>keine</span>
      <span class='ocrx_word' id='word_1_191' title='bbox 1713 2370 1846 2405; x_wconf 50'>voliti/</span>
     </span>
wollmers commented 3 years ago

Seems more that Tesseract can not handle warped lines overlapping vertically.

It guesses the baseline wrong as nearly horizontal line and not a polyline (or curve), then scans along the wrong baseline loosing the full height of characters at end of line_1_21.

In the next line the characters of the previous line get scanned.

It's a problem of segmentation into lines and also deskewing and dewarping.

If I segment the lines into image files with another tool Tesseract gives good (CER ~4%) results with --psm 7 on the warped line. If the line is dewarped too the result is nearly perfect (1 noisy from a speckle, 1 I/J mismatch coming from training).

amitdo commented 2 years ago

Indeed, the textlnes finding algorithm in Tesseract can't cope with overlapping lines.

stweil commented 2 years ago

Is this a regression, or is it a bug which exists for a long time now?

amitdo commented 2 years ago

I don't think this is a regression, but without testing and comparing to previous versions, I can't say with total confidence it's not a regression.