tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.88k stars 9.47k forks source link

Orientation detection "asymmetrical" #4116

Open jbarth-ubhd opened 1 year ago

jbarth-ubhd commented 1 year ago

Current Behavior

Did run a 2 column german text (portrait + landscape) at (ImageMagick-)angles 0°, 90°, 180°, 270° each ± 3°, partially with ±.1° jitter.

PDF files (converted to .tif (400dpi, group4, using ImageMagick with options -flatten + +repage)) (Text from Wikipedia CC BY-SA 4.0): https://digi.ub.uni-heidelberg.de/diglitData/v/gt-portrait.pdf https://digi.ub.uni-heidelberg.de/diglitData/v/gt-landscape.pdf

OCR'd .tifs (tesseract: --psm 1): https://digi.ub.uni-heidelberg.de/diglitData/v/ocr-orientation-test.zip

The following table contains the number of errors (according to sdiff() of perl module Algorithm::Diff):

OCR errors word 'Murray' textord_debug_bugs = 1 Log size
gt-landscape.txt 0 14 -
gt-portrait.txt 0 14 -
tesseract-default-landscape-000.000.txt 7 14 -
tesseract-default-landscape-090.000.txt 7 14 -
tesseract-default-landscape-180.000.txt 4321 0 -
tesseract-default-landscape-270.000.txt 4321 0 -
tesseract-default-portrait-000.000.txt 5 14 -
tesseract-default-portrait-090.000.txt 5 14 -
tesseract-default-portrait-180.000.txt 4330 0 -
tesseract-default-portrait-270.000.txt 4330 0 -
tesseract-psm1-landscape-000.000.txt 7 14 -
tesseract-psm1-landscape-000.087.txt 8 14 0
tesseract-psm1-landscape-000.914.txt 7 14 0
tesseract-psm1-landscape-002.000.txt 626 12 -
tesseract-psm1-landscape-002.083.txt 351 12 0
tesseract-psm1-landscape-003.024.txt 3530 0 1752
tesseract-psm1-landscape-087.097.txt 21 14 3108
tesseract-psm1-landscape-087.948.txt 10 14 413
tesseract-psm1-landscape-088.000.txt 9 14 -
tesseract-psm1-landscape-089.010.txt 8 14 0
tesseract-psm1-landscape-090.000.txt 7 14 -
tesseract-psm1-landscape-090.079.txt 8 14 0
tesseract-psm1-landscape-091.033.txt 7 14 0
tesseract-psm1-landscape-092.000.txt 558 12 -
tesseract-psm1-landscape-092.078.txt 840 12 1075
tesseract-psm1-landscape-093.028.txt 3043 11 2072
tesseract-psm1-landscape-176.932.txt 21 14 3739
tesseract-psm1-landscape-178.000.txt 9 14 -
tesseract-psm1-landscape-178.006.txt 10 14 0
tesseract-psm1-landscape-179.085.txt 9 14 0
tesseract-psm1-landscape-179.992.txt 7 14 0
tesseract-psm1-landscape-180.000.txt 8 14 -
tesseract-psm1-landscape-180.940.txt 7 14 0
tesseract-psm1-landscape-182.000.txt 499 13 -
tesseract-psm1-landscape-182.093.txt 1356 9 365
tesseract-psm1-landscape-182.961.txt 3054 3 981
tesseract-psm1-landscape-266.930.txt 18 14 3135
tesseract-psm1-landscape-268.000.txt 9 14 -
tesseract-psm1-landscape-268.031.txt 10 14 604
tesseract-psm1-landscape-268.968.txt 7 14 0
tesseract-psm1-landscape-270.000.txt 8 14 -
tesseract-psm1-landscape-270.019.txt 8 14 0
tesseract-psm1-landscape-270.949.txt 7 14 0
tesseract-psm1-landscape-271.924.txt 8 14 0
tesseract-psm1-landscape-272.000.txt 367 14 -
tesseract-psm1-landscape-273.021.txt 3203 1 933
tesseract-psm1-landscape-357.043.txt 16 14 2169
tesseract-psm1-landscape-358.000.txt 9 14 -
tesseract-psm1-landscape-358.071.txt 10 14 0
tesseract-psm1-landscape-359.049.txt 9 14 0
tesseract-psm1-portrait-000.000.txt 5 14 -
tesseract-psm1-portrait-000.925.txt 6 14 0
tesseract-psm1-portrait-001.970.txt 7 14 0
tesseract-psm1-portrait-002.000.txt 8 14 -
tesseract-psm1-portrait-003.025.txt 368 13 0
tesseract-psm1-portrait-087.067.txt 22 14 1448
tesseract-psm1-portrait-088.000.txt 9 14 -
tesseract-psm1-portrait-088.039.txt 10 14 0
tesseract-psm1-portrait-088.930.txt 5 14 0
tesseract-psm1-portrait-090.000.txt 5 14 -
tesseract-psm1-portrait-090.056.txt 5 14 0
tesseract-psm1-portrait-090.973.txt 5 14 0
tesseract-psm1-portrait-092.000.txt 8 14 -
tesseract-psm1-portrait-092.087.txt 10 14 0
tesseract-psm1-portrait-092.929.txt 713 11 0
tesseract-psm1-portrait-177.070.txt 23 14 638
tesseract-psm1-portrait-177.924.txt 8 14 0
tesseract-psm1-portrait-178.000.txt 8 14 -
tesseract-psm1-portrait-179.064.txt 5 14 0
tesseract-psm1-portrait-179.979.txt 5 14 0
tesseract-psm1-portrait-180.000.txt 5 14 -
tesseract-psm1-portrait-181.081.txt 8 14 0
tesseract-psm1-portrait-181.930.txt 15 14 0
tesseract-psm1-portrait-182.000.txt 8 14 -
tesseract-psm1-portrait-183.081.txt 894 11 572
tesseract-psm1-portrait-267.005.txt 28 14 793
tesseract-psm1-portrait-267.958.txt 10 14 0
tesseract-psm1-portrait-268.000.txt 8 14 -
tesseract-psm1-portrait-269.070.txt 6 14 0
tesseract-psm1-portrait-269.970.txt 5 14 0
tesseract-psm1-portrait-270.000.txt 5 14 -
tesseract-psm1-portrait-271.071.txt 5 14 0
tesseract-psm1-portrait-272.000.txt 8 14 -
tesseract-psm1-portrait-272.048.txt 8 14 0
tesseract-psm1-portrait-272.927.txt 324 13 0
tesseract-psm1-portrait-356.956.txt 30 14 979
tesseract-psm1-portrait-357.983.txt 9 14 0
tesseract-psm1-portrait-358.000.txt 8 14 -
tesseract-psm1-portrait-359.085.txt 6 14 0
tesseract-psm1-portrait-359.926.txt 5 14 0
abbyy-default-landscape-000.000.txt 2 14 -
abbyy-default-landscape-090.000.txt 3262 1 -
abbyy-default-landscape-180.000.txt 4028 0 -
abbyy-default-landscape-270.000.txt 4444 1 -
abbyy-default-portrait-000.000.txt 1 14 -
abbyy-default-portrait-090.000.txt 3485 0 -
abbyy-default-portrait-180.000.txt 3960 0 -
abbyy-default-portrait-270.000.txt 3479 0 -
abbyy-detectImageOrientation-landscape-000.000.txt 1 14 -
abbyy-detectImageOrientation-landscape-000.087.txt 0 14 -
abbyy-detectImageOrientation-landscape-000.914.txt 0 14 -
abbyy-detectImageOrientation-landscape-002.000.txt 1 14 -
abbyy-detectImageOrientation-landscape-002.083.txt 0 14 -
abbyy-detectImageOrientation-landscape-003.024.txt 0 14 -
abbyy-detectImageOrientation-landscape-087.097.txt 0 14 -
abbyy-detectImageOrientation-landscape-087.948.txt 0 14 -
abbyy-detectImageOrientation-landscape-088.000.txt 1 14 -
abbyy-detectImageOrientation-landscape-089.010.txt 1 14 -
abbyy-detectImageOrientation-landscape-090.000.txt 1 14 -
abbyy-detectImageOrientation-landscape-090.079.txt 0 14 -
abbyy-detectImageOrientation-landscape-091.033.txt 0 14 -
abbyy-detectImageOrientation-landscape-092.000.txt 1 14 -
abbyy-detectImageOrientation-landscape-092.078.txt 1 14 -
abbyy-detectImageOrientation-landscape-093.028.txt 1 14 -
abbyy-detectImageOrientation-landscape-176.932.txt 0 14 -
abbyy-detectImageOrientation-landscape-178.000.txt 1 14 -
abbyy-detectImageOrientation-landscape-178.006.txt 2 14 -
abbyy-detectImageOrientation-landscape-179.085.txt 1 14 -
abbyy-detectImageOrientation-landscape-179.992.txt 0 14 -
abbyy-detectImageOrientation-landscape-180.000.txt 1 14 -
abbyy-detectImageOrientation-landscape-180.940.txt 0 14 -
abbyy-detectImageOrientation-landscape-182.000.txt 1 14 -
abbyy-detectImageOrientation-landscape-182.093.txt 0 14 -
abbyy-detectImageOrientation-landscape-182.961.txt 0 14 -
abbyy-detectImageOrientation-landscape-266.930.txt 0 14 -
abbyy-detectImageOrientation-landscape-268.000.txt 1 14 -
abbyy-detectImageOrientation-landscape-268.031.txt 0 14 -
abbyy-detectImageOrientation-landscape-268.968.txt 0 14 -
abbyy-detectImageOrientation-landscape-270.000.txt 1 14 -
abbyy-detectImageOrientation-landscape-270.019.txt 0 14 -
abbyy-detectImageOrientation-landscape-270.949.txt 1 14 -
abbyy-detectImageOrientation-landscape-271.924.txt 1 14 -
abbyy-detectImageOrientation-landscape-272.000.txt 1 14 -
abbyy-detectImageOrientation-landscape-273.021.txt 1 14 -
abbyy-detectImageOrientation-landscape-357.043.txt 0 14 -
abbyy-detectImageOrientation-landscape-358.000.txt 1 14 -
abbyy-detectImageOrientation-landscape-358.071.txt 1 14 -
abbyy-detectImageOrientation-landscape-359.049.txt 1 14 -
abbyy-detectImageOrientation-portrait-000.000.txt 1 14 -
abbyy-detectImageOrientation-portrait-000.925.txt 0 14 -
abbyy-detectImageOrientation-portrait-001.970.txt 1 14 -
abbyy-detectImageOrientation-portrait-002.000.txt 0 14 -
abbyy-detectImageOrientation-portrait-003.025.txt 0 14 -
abbyy-detectImageOrientation-portrait-087.067.txt 3 14 -
abbyy-detectImageOrientation-portrait-088.000.txt 0 14 -
abbyy-detectImageOrientation-portrait-088.039.txt 0 14 -
abbyy-detectImageOrientation-portrait-088.930.txt 0 14 -
abbyy-detectImageOrientation-portrait-090.000.txt 1 14 -
abbyy-detectImageOrientation-portrait-090.056.txt 0 14 -
abbyy-detectImageOrientation-portrait-090.973.txt 0 14 -
abbyy-detectImageOrientation-portrait-092.000.txt 0 14 -
abbyy-detectImageOrientation-portrait-092.087.txt 0 14 -
abbyy-detectImageOrientation-portrait-092.929.txt 1 14 -
abbyy-detectImageOrientation-portrait-177.070.txt 0 14 -
abbyy-detectImageOrientation-portrait-177.924.txt 2 14 -
abbyy-detectImageOrientation-portrait-178.000.txt 0 14 -
abbyy-detectImageOrientation-portrait-179.064.txt 0 14 -
abbyy-detectImageOrientation-portrait-179.979.txt 0 14 -
abbyy-detectImageOrientation-portrait-180.000.txt 1 14 -
abbyy-detectImageOrientation-portrait-181.081.txt 0 14 -
abbyy-detectImageOrientation-portrait-181.930.txt 0 14 -
abbyy-detectImageOrientation-portrait-182.000.txt 2 14 -
abbyy-detectImageOrientation-portrait-183.081.txt 2 14 -
abbyy-detectImageOrientation-portrait-267.005.txt 2 14 -
abbyy-detectImageOrientation-portrait-267.958.txt 0 14 -
abbyy-detectImageOrientation-portrait-268.000.txt 0 14 -
abbyy-detectImageOrientation-portrait-269.070.txt 0 14 -
abbyy-detectImageOrientation-portrait-269.970.txt 1 14 -
abbyy-detectImageOrientation-portrait-270.000.txt 1 14 -
abbyy-detectImageOrientation-portrait-271.071.txt 0 14 -
abbyy-detectImageOrientation-portrait-272.000.txt 2 14 -
abbyy-detectImageOrientation-portrait-272.048.txt 0 14 -
abbyy-detectImageOrientation-portrait-272.927.txt 1 14 -
abbyy-detectImageOrientation-portrait-356.956.txt 2 14 -
abbyy-detectImageOrientation-portrait-357.983.txt 0 14 -
abbyy-detectImageOrientation-portrait-358.000.txt 0 14 -
abbyy-detectImageOrientation-portrait-359.085.txt 3 14 -
abbyy-detectImageOrientation-portrait-359.926.txt 0 14 -

Expected Behavior

I've expected that 87° rotation would have nearly the same number of errors as 93°, but 93° has far more errors than 87°. Same for ±0°, ±180°, ±270°.

(Abbyy is much better at this, btw)

Suggested Fix

none

tesseract -v

tesseract 5.3.1 leptonica-1.79.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1 Found AVX512BW Found AVX512F Found AVX2 Found AVX Found FMA Found SSE4.1 Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4 Found libcurl/7.68.0 NSS/3.49.1 zlib/1.2.11 brotli/1.0.7 libidn2/2.2.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 librtmp/2.3

Operating System

Ubuntu 20.04 Focal

Other Operating System

No response

uname -a

Linux XXXXXXX 5.4.0-155-generic #172-Ubuntu SMP Fri Jul 7 16:10:02 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Compiler

gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

CPU

Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz

Virtualization / Containers

no

Other Information

No response

stweil commented 1 year ago

Thank you for that test. Maybe that issue is related to #3021. Could you please try running tesseract with -c textord_debug_bugs=1? If that prints error messages, then it is.

jbarth-ubhd commented 1 year ago

https://digi.ub.uni-heidelberg.de/diglitData/v/ocr-orientation-test--logs.zip . A lot are 0 bytes ?!

jbarth-ubhd commented 1 year ago

Added log file size to table. Does not correlate.

amitdo commented 1 year ago

The hocr output contains the skew angle of the text lines. You can try to use this info and manually reskew the image and then rerun Tesseract.

Balearica commented 10 months ago

4070 allows for retrieving the skew calculated by Tesseract without running recognition. If you use this information to rotate the page, you will find this closes most of the accuracy gap between Tesseract and Abbyy.

It has been my experience that Abbyy blows Tesseract out of the water in real-world usage, however this 90% attributable to the fact that Abbyy automatically corrects skew but Tesseract does not. If you rotate each image by the skew angle calculated by Tesseract prior to running Tesseract recognition, Tesseract performs (almost) comparably to Abbyy on high-quality documents.

zdenop commented 10 months ago

Image preprocessing (including Deskewing) is a suggested technique for a year by Tesseract docs...

jbarth-ubhd commented 10 months ago

Perhaps the asymetry in recognition quality of +/- angles has simply to do with the traineddata model?

amitdo commented 10 months ago

Did you try both the fast and best models?

jbarth-ubhd commented 10 months ago

I've used only deu.traineddata md5sum f5488b7c3186e822e0e6c5c05c1aaf1f size 15437534

jbarth-ubhd commented 9 months ago

I'll tend to close this issue and I'll think it is important to remind users, that no deskew is performed by tesseract.

jbarth-ubhd commented 9 months ago

Error count for tesseract 5.3.3 (-l deu) with angles from -5 to +5 degrees (positive=clockwise) on the first page of this https://digi.ub.uni-heidelberg.de/diglitData/v/layout-fouche.pdf (400 dpi rendered b/w)

Seems that primary segmentation has problems with rotated images.

angle1

zoom to -1.5 to +1.5 degrees: angle2