Open wollmers opened 3 years ago
Yes, Tesseract sometimes fails even with easy cases.
Apart from the layout analysis issue, x_wconf
also looks strange.
@amitdo
Apart from the layout analysis issue,
x_wconf
also looks strange.
Sure. The original page has 19 lines and it's even more crazy.
With --oem 1
I get
1711 P F.KD kt ‚
Original-Breitkopf-Fraktur
and hOCR for the words F.KD kt
in the first line:
<span class='ocrx_word' id='word_1_3' title='bbox 830 41 935 71; x_wconf 0; x_fsize 30'>
<span class='ocrx_cinfo' title='x_bboxes 830 41 846 49; x_conf 93.069252'>F</span>
<span class='ocrx_cinfo' title='x_bboxes 848 41 870 71; x_conf 85.260864'>.</span>
<span class='ocrx_cinfo' title='x_bboxes 869 44 906 71; x_conf 86.037674'>K</span>
<span class='ocrx_cinfo' title='x_bboxes 927 57 935 69; x_conf 80.748856'>D</span>
</span>
<span class='ocrx_word' id='word_1_4' title='bbox 972 44 1016 102; x_wconf 92; x_fsize 30'>
<span class='ocrx_cinfo' title='x_bboxes 972 44 992 102; x_conf 98.871193'>k</span>
<span class='ocrx_cinfo' title='x_bboxes 997 51 1016 102; x_conf 99.022659'>t</span>
</span>
x_wconf
is not reliable. It's not obvious how it is calculated.
With --oem 0
I get
I « o i« « « o O kt i
Original-Buitkops-z31a m
and hOCR for the word kt
in the first line:
<span class='ocrx_word' id='word_1_9' title='bbox 972 44 1016 102; x_wconf 76; x_font swe.fontfile_7; x_fsize 7'>
<span class='ocrx_cinfo' title='x_bboxes 972 44 992 102; x_conf 76.119217'>k</span>
<span class='ocrx_cinfo' title='x_bboxes 997 51 1016 102; x_conf 78.345688'>t</span>
</span>
With --oem 0
x_wconf
is always the minimum of the included character x_conf
s.
$ tesseract -v
tesseract 5.0.1
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found NEON
Found OpenMP 201511
Found libarchive 3.2.2 zlib/1.2.11 liblzma/5.2.2 bz2lib/1.0.6 liblz4/1.7.1
Found libcurl/7.58.0 NSS/3.35 zlib/1.2.11 libidn2/2.0.4 libpsl/0.19.1 (+libidn2/2.0.4) nghttp2/1.30.0 librtmp/2.3
$ tesseract extraline.png -
' ' ' 4 al (y
Original-Breitfopf-Fraftur
nach dev von Gottlob Inumanuel Breitfopf gegen 1750 gefdhnittenen Vreitbopf = Fraftur
Hevgeftellt von der Schriftgicfierei HH. Berthold AG in Leipyig.
Environment
Current Behavior:
Segments an extra line which results in noise.
Image sample:
hOCR result with options
--psm 6 --oem 1
:Image with baselines (blue) and line bounding boxes (green):
The first line in a single image
segments perfect with the same
--psm 6 --oem 1
options:Expected Behavior:
Segmentation should work in simple and obvious cases. The example has no overlapping, skewed, warped or connected lines.
Suggested Fix:
The whole segmentation process needs a deep review and tests for the most common cases.