Open jbarth-ubhd opened 1 year ago
Thank you for that test. Maybe that issue is related to #3021. Could you please try running tesseract
with -c textord_debug_bugs=1
? If that prints error messages, then it is.
https://digi.ub.uni-heidelberg.de/diglitData/v/ocr-orientation-test--logs.zip . A lot are 0 bytes ?!
Added log file size to table. Does not correlate.
The hocr output contains the skew angle of the text lines. You can try to use this info and manually reskew the image and then rerun Tesseract.
It has been my experience that Abbyy blows Tesseract out of the water in real-world usage, however this 90% attributable to the fact that Abbyy automatically corrects skew but Tesseract does not. If you rotate each image by the skew angle calculated by Tesseract prior to running Tesseract recognition, Tesseract performs (almost) comparably to Abbyy on high-quality documents.
Image preprocessing (including Deskewing) is a suggested technique for a year by Tesseract docs...
Perhaps the asymetry in recognition quality of +/- angles has simply to do with the traineddata model?
Did you try both the fast
and best
models?
I've used only deu.traineddata md5sum f5488b7c3186e822e0e6c5c05c1aaf1f size 15437534
I'll tend to close this issue and I'll think it is important to remind users, that no deskew is performed by tesseract.
Error count for tesseract 5.3.3 (-l deu) with angles from -5 to +5 degrees (positive=clockwise) on the first page of this https://digi.ub.uni-heidelberg.de/diglitData/v/layout-fouche.pdf (400 dpi rendered b/w)
Seems that primary segmentation has problems with rotated images.
zoom to -1.5 to +1.5 degrees:
Current Behavior
Did run a 2 column german text (portrait + landscape) at (ImageMagick-)angles 0°, 90°, 180°, 270° each ± 3°, partially with ±.1° jitter.
PDF files (converted to
.tif
(400dpi, group4, using ImageMagick with options-flatten
++repage
)) (Text from Wikipedia CC BY-SA 4.0): https://digi.ub.uni-heidelberg.de/diglitData/v/gt-portrait.pdf https://digi.ub.uni-heidelberg.de/diglitData/v/gt-landscape.pdfOCR'd
.tif
s (tesseract:--psm 1
): https://digi.ub.uni-heidelberg.de/diglitData/v/ocr-orientation-test.zipThe following table contains the number of errors (according to
sdiff()
of perl module Algorithm::Diff):Expected Behavior
I've expected that 87° rotation would have nearly the same number of errors as 93°, but 93° has far more errors than 87°. Same for ±0°, ±180°, ±270°.
(Abbyy is much better at this, btw)
Suggested Fix
none
tesseract -v
tesseract 5.3.1 leptonica-1.79.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1 Found AVX512BW Found AVX512F Found AVX2 Found AVX Found FMA Found SSE4.1 Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4 Found libcurl/7.68.0 NSS/3.49.1 zlib/1.2.11 brotli/1.0.7 libidn2/2.2.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 librtmp/2.3
Operating System
Ubuntu 20.04 Focal
Other Operating System
No response
uname -a
Linux XXXXXXX 5.4.0-155-generic #172-Ubuntu SMP Fri Jul 7 16:10:02 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Compiler
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
CPU
Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz
Virtualization / Containers
no
Other Information
No response