jbarth-ubhd commented 1 year ago

Current Behavior

Did run a 2 column german text (portrait + landscape) at (ImageMagick-)angles 0°, 90°, 180°, 270° each ± 3°, partially with ±.1° jitter.

PDF files (converted to .tif (400dpi, group4, using ImageMagick with options -flatten + +repage)) (Text from Wikipedia CC BY-SA 4.0): https://digi.ub.uni-heidelberg.de/diglitData/v/gt-portrait.pdf https://digi.ub.uni-heidelberg.de/diglitData/v/gt-landscape.pdf

OCR'd .tifs (tesseract: --psm 1): https://digi.ub.uni-heidelberg.de/diglitData/v/ocr-orientation-test.zip

The following table contains the number of errors (according to sdiff() of perl module Algorithm::Diff):

OCR	errors	word 'Murray'	textord_debug_bugs = 1 Log size
gt-landscape.txt	0	14	-
gt-portrait.txt	0	14	-
tesseract-default-landscape-000.000.txt	7	14	-
tesseract-default-landscape-090.000.txt	7	14	-
tesseract-default-landscape-180.000.txt	4321	0	-
tesseract-default-landscape-270.000.txt	4321	0	-
tesseract-default-portrait-000.000.txt	5	14	-
tesseract-default-portrait-090.000.txt	5	14	-
tesseract-default-portrait-180.000.txt	4330	0	-
tesseract-default-portrait-270.000.txt	4330	0	-
tesseract-psm1-landscape-000.000.txt	7	14	-
tesseract-psm1-landscape-000.087.txt	8	14	0
tesseract-psm1-landscape-000.914.txt	7	14	0
tesseract-psm1-landscape-002.000.txt	626	12	-
tesseract-psm1-landscape-002.083.txt	351	12	0
tesseract-psm1-landscape-003.024.txt	3530	0	1752
tesseract-psm1-landscape-087.097.txt	21	14	3108
tesseract-psm1-landscape-087.948.txt	10	14	413
tesseract-psm1-landscape-088.000.txt	9	14	-
tesseract-psm1-landscape-089.010.txt	8	14	0
tesseract-psm1-landscape-090.000.txt	7	14	-
tesseract-psm1-landscape-090.079.txt	8	14	0
tesseract-psm1-landscape-091.033.txt	7	14	0
tesseract-psm1-landscape-092.000.txt	558	12	-
tesseract-psm1-landscape-092.078.txt	840	12	1075
tesseract-psm1-landscape-093.028.txt	3043	11	2072
tesseract-psm1-landscape-176.932.txt	21	14	3739
tesseract-psm1-landscape-178.000.txt	9	14	-
tesseract-psm1-landscape-178.006.txt	10	14	0
tesseract-psm1-landscape-179.085.txt	9	14	0
tesseract-psm1-landscape-179.992.txt	7	14	0
tesseract-psm1-landscape-180.000.txt	8	14	-
tesseract-psm1-landscape-180.940.txt	7	14	0
tesseract-psm1-landscape-182.000.txt	499	13	-
tesseract-psm1-landscape-182.093.txt	1356	9	365
tesseract-psm1-landscape-182.961.txt	3054	3	981
tesseract-psm1-landscape-266.930.txt	18	14	3135
tesseract-psm1-landscape-268.000.txt	9	14	-
tesseract-psm1-landscape-268.031.txt	10	14	604
tesseract-psm1-landscape-268.968.txt	7	14	0
tesseract-psm1-landscape-270.000.txt	8	14	-
tesseract-psm1-landscape-270.019.txt	8	14	0
tesseract-psm1-landscape-270.949.txt	7	14	0
tesseract-psm1-landscape-271.924.txt	8	14	0
tesseract-psm1-landscape-272.000.txt	367	14	-
tesseract-psm1-landscape-273.021.txt	3203	1	933
tesseract-psm1-landscape-357.043.txt	16	14	2169
tesseract-psm1-landscape-358.000.txt	9	14	-
tesseract-psm1-landscape-358.071.txt	10	14	0
tesseract-psm1-landscape-359.049.txt	9	14	0
tesseract-psm1-portrait-000.000.txt	5	14	-
tesseract-psm1-portrait-000.925.txt	6	14	0
tesseract-psm1-portrait-001.970.txt	7	14	0
tesseract-psm1-portrait-002.000.txt	8	14	-
tesseract-psm1-portrait-003.025.txt	368	13	0
tesseract-psm1-portrait-087.067.txt	22	14	1448
tesseract-psm1-portrait-088.000.txt	9	14	-
tesseract-psm1-portrait-088.039.txt	10	14	0
tesseract-psm1-portrait-088.930.txt	5	14	0
tesseract-psm1-portrait-090.000.txt	5	14	-
tesseract-psm1-portrait-090.056.txt	5	14	0
tesseract-psm1-portrait-090.973.txt	5	14	0
tesseract-psm1-portrait-092.000.txt	8	14	-
tesseract-psm1-portrait-092.087.txt	10	14	0
tesseract-psm1-portrait-092.929.txt	713	11	0
tesseract-psm1-portrait-177.070.txt	23	14	638
tesseract-psm1-portrait-177.924.txt	8	14	0
tesseract-psm1-portrait-178.000.txt	8	14	-
tesseract-psm1-portrait-179.064.txt	5	14	0
tesseract-psm1-portrait-179.979.txt	5	14	0
tesseract-psm1-portrait-180.000.txt	5	14	-
tesseract-psm1-portrait-181.081.txt	8	14	0
tesseract-psm1-portrait-181.930.txt	15	14	0
tesseract-psm1-portrait-182.000.txt	8	14	-
tesseract-psm1-portrait-183.081.txt	894	11	572
tesseract-psm1-portrait-267.005.txt	28	14	793
tesseract-psm1-portrait-267.958.txt	10	14	0
tesseract-psm1-portrait-268.000.txt	8	14	-
tesseract-psm1-portrait-269.070.txt	6	14	0
tesseract-psm1-portrait-269.970.txt	5	14	0
tesseract-psm1-portrait-270.000.txt	5	14	-
tesseract-psm1-portrait-271.071.txt	5	14	0
tesseract-psm1-portrait-272.000.txt	8	14	-
tesseract-psm1-portrait-272.048.txt	8	14	0
tesseract-psm1-portrait-272.927.txt	324	13	0
tesseract-psm1-portrait-356.956.txt	30	14	979
tesseract-psm1-portrait-357.983.txt	9	14	0
tesseract-psm1-portrait-358.000.txt	8	14	-
tesseract-psm1-portrait-359.085.txt	6	14	0
tesseract-psm1-portrait-359.926.txt	5	14	0
abbyy-default-landscape-000.000.txt	2	14	-
abbyy-default-landscape-090.000.txt	3262	1	-
abbyy-default-landscape-180.000.txt	4028	0	-
abbyy-default-landscape-270.000.txt	4444	1	-
abbyy-default-portrait-000.000.txt	1	14	-
abbyy-default-portrait-090.000.txt	3485	0	-
abbyy-default-portrait-180.000.txt	3960	0	-
abbyy-default-portrait-270.000.txt	3479	0	-
abbyy-detectImageOrientation-landscape-000.000.txt	1	14	-
abbyy-detectImageOrientation-landscape-000.087.txt	0	14	-
abbyy-detectImageOrientation-landscape-000.914.txt	0	14	-
abbyy-detectImageOrientation-landscape-002.000.txt	1	14	-
abbyy-detectImageOrientation-landscape-002.083.txt	0	14	-
abbyy-detectImageOrientation-landscape-003.024.txt	0	14	-
abbyy-detectImageOrientation-landscape-087.097.txt	0	14	-
abbyy-detectImageOrientation-landscape-087.948.txt	0	14	-
abbyy-detectImageOrientation-landscape-088.000.txt	1	14	-
abbyy-detectImageOrientation-landscape-089.010.txt	1	14	-
abbyy-detectImageOrientation-landscape-090.000.txt	1	14	-
abbyy-detectImageOrientation-landscape-090.079.txt	0	14	-
abbyy-detectImageOrientation-landscape-091.033.txt	0	14	-
abbyy-detectImageOrientation-landscape-092.000.txt	1	14	-
abbyy-detectImageOrientation-landscape-092.078.txt	1	14	-
abbyy-detectImageOrientation-landscape-093.028.txt	1	14	-
abbyy-detectImageOrientation-landscape-176.932.txt	0	14	-
abbyy-detectImageOrientation-landscape-178.000.txt	1	14	-
abbyy-detectImageOrientation-landscape-178.006.txt	2	14	-
abbyy-detectImageOrientation-landscape-179.085.txt	1	14	-
abbyy-detectImageOrientation-landscape-179.992.txt	0	14	-
abbyy-detectImageOrientation-landscape-180.000.txt	1	14	-
abbyy-detectImageOrientation-landscape-180.940.txt	0	14	-
abbyy-detectImageOrientation-landscape-182.000.txt	1	14	-
abbyy-detectImageOrientation-landscape-182.093.txt	0	14	-
abbyy-detectImageOrientation-landscape-182.961.txt	0	14	-
abbyy-detectImageOrientation-landscape-266.930.txt	0	14	-
abbyy-detectImageOrientation-landscape-268.000.txt	1	14	-
abbyy-detectImageOrientation-landscape-268.031.txt	0	14	-
abbyy-detectImageOrientation-landscape-268.968.txt	0	14	-
abbyy-detectImageOrientation-landscape-270.000.txt	1	14	-
abbyy-detectImageOrientation-landscape-270.019.txt	0	14	-
abbyy-detectImageOrientation-landscape-270.949.txt	1	14	-
abbyy-detectImageOrientation-landscape-271.924.txt	1	14	-
abbyy-detectImageOrientation-landscape-272.000.txt	1	14	-
abbyy-detectImageOrientation-landscape-273.021.txt	1	14	-
abbyy-detectImageOrientation-landscape-357.043.txt	0	14	-
abbyy-detectImageOrientation-landscape-358.000.txt	1	14	-
abbyy-detectImageOrientation-landscape-358.071.txt	1	14	-
abbyy-detectImageOrientation-landscape-359.049.txt	1	14	-
abbyy-detectImageOrientation-portrait-000.000.txt	1	14	-
abbyy-detectImageOrientation-portrait-000.925.txt	0	14	-
abbyy-detectImageOrientation-portrait-001.970.txt	1	14	-
abbyy-detectImageOrientation-portrait-002.000.txt	0	14	-
abbyy-detectImageOrientation-portrait-003.025.txt	0	14	-
abbyy-detectImageOrientation-portrait-087.067.txt	3	14	-
abbyy-detectImageOrientation-portrait-088.000.txt	0	14	-
abbyy-detectImageOrientation-portrait-088.039.txt	0	14	-
abbyy-detectImageOrientation-portrait-088.930.txt	0	14	-
abbyy-detectImageOrientation-portrait-090.000.txt	1	14	-
abbyy-detectImageOrientation-portrait-090.056.txt	0	14	-
abbyy-detectImageOrientation-portrait-090.973.txt	0	14	-
abbyy-detectImageOrientation-portrait-092.000.txt	0	14	-
abbyy-detectImageOrientation-portrait-092.087.txt	0	14	-
abbyy-detectImageOrientation-portrait-092.929.txt	1	14	-
abbyy-detectImageOrientation-portrait-177.070.txt	0	14	-
abbyy-detectImageOrientation-portrait-177.924.txt	2	14	-
abbyy-detectImageOrientation-portrait-178.000.txt	0	14	-
abbyy-detectImageOrientation-portrait-179.064.txt	0	14	-
abbyy-detectImageOrientation-portrait-179.979.txt	0	14	-
abbyy-detectImageOrientation-portrait-180.000.txt	1	14	-
abbyy-detectImageOrientation-portrait-181.081.txt	0	14	-
abbyy-detectImageOrientation-portrait-181.930.txt	0	14	-
abbyy-detectImageOrientation-portrait-182.000.txt	2	14	-
abbyy-detectImageOrientation-portrait-183.081.txt	2	14	-
abbyy-detectImageOrientation-portrait-267.005.txt	2	14	-
abbyy-detectImageOrientation-portrait-267.958.txt	0	14	-
abbyy-detectImageOrientation-portrait-268.000.txt	0	14	-
abbyy-detectImageOrientation-portrait-269.070.txt	0	14	-
abbyy-detectImageOrientation-portrait-269.970.txt	1	14	-
abbyy-detectImageOrientation-portrait-270.000.txt	1	14	-
abbyy-detectImageOrientation-portrait-271.071.txt	0	14	-
abbyy-detectImageOrientation-portrait-272.000.txt	2	14	-
abbyy-detectImageOrientation-portrait-272.048.txt	0	14	-
abbyy-detectImageOrientation-portrait-272.927.txt	1	14	-
abbyy-detectImageOrientation-portrait-356.956.txt	2	14	-
abbyy-detectImageOrientation-portrait-357.983.txt	0	14	-
abbyy-detectImageOrientation-portrait-358.000.txt	0	14	-
abbyy-detectImageOrientation-portrait-359.085.txt	3	14	-
abbyy-detectImageOrientation-portrait-359.926.txt	0	14	-

Expected Behavior

I've expected that 87° rotation would have nearly the same number of errors as 93°, but 93° has far more errors than 87°. Same for ±0°, ±180°, ±270°.

(Abbyy is much better at this, btw)

Suggested Fix

none

tesseract -v

tesseract 5.3.1 leptonica-1.79.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1 Found AVX512BW Found AVX512F Found AVX2 Found AVX Found FMA Found SSE4.1 Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4 Found libcurl/7.68.0 NSS/3.49.1 zlib/1.2.11 brotli/1.0.7 libidn2/2.2.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 librtmp/2.3

Operating System

Ubuntu 20.04 Focal

Other Operating System

No response

uname -a

Linux XXXXXXX 5.4.0-155-generic #172-Ubuntu SMP Fri Jul 7 16:10:02 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Compiler

gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

CPU

Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz

Virtualization / Containers

no

Other Information

No response

stweil commented 1 year ago

Thank you for that test. Maybe that issue is related to #3021. Could you please try running tesseract with -c textord_debug_bugs=1? If that prints error messages, then it is.

jbarth-ubhd commented 1 year ago

https://digi.ub.uni-heidelberg.de/diglitData/v/ocr-orientation-test--logs.zip . A lot are 0 bytes ?!

jbarth-ubhd commented 1 year ago

Added log file size to table. Does not correlate.

amitdo commented 1 year ago

The hocr output contains the skew angle of the text lines. You can try to use this info and manually reskew the image and then rerun Tesseract.

Balearica commented 10 months ago

4070 allows for retrieving the skew calculated by Tesseract without running recognition. If you use this information to rotate the page, you will find this closes most of the accuracy gap between Tesseract and Abbyy.

It has been my experience that Abbyy blows Tesseract out of the water in real-world usage, however this 90% attributable to the fact that Abbyy automatically corrects skew but Tesseract does not. If you rotate each image by the skew angle calculated by Tesseract prior to running Tesseract recognition, Tesseract performs (almost) comparably to Abbyy on high-quality documents.

zdenop commented 10 months ago

Image preprocessing (including Deskewing) is a suggested technique for a year by Tesseract docs...

jbarth-ubhd commented 10 months ago

Perhaps the asymetry in recognition quality of +/- angles has simply to do with the traineddata model?

amitdo commented 10 months ago

Did you try both the fast and best models?

jbarth-ubhd commented 10 months ago

I've used only deu.traineddata md5sum f5488b7c3186e822e0e6c5c05c1aaf1f size 15437534

jbarth-ubhd commented 9 months ago

I'll tend to close this issue and I'll think it is important to remind users, that no deskew is performed by tesseract.

jbarth-ubhd commented 9 months ago

Error count for tesseract 5.3.3 (-l deu) with angles from -5 to +5 degrees (positive=clockwise) on the first page of this https://digi.ub.uni-heidelberg.de/diglitData/v/layout-fouche.pdf (400 dpi rendered b/w)

Seems that primary segmentation has problems with rotated images.

angle1

zoom to -1.5 to +1.5 degrees: angle2

tesseract-ocr / tesseract

Orientation detection "asymmetrical" #4116

Current Behavior

Expected Behavior

Suggested Fix

tesseract -v

Operating System

Other Operating System

uname -a

Compiler

CPU

Virtualization / Containers

Other Information

4070 allows for retrieving the skew calculated by Tesseract without running recognition. If you use this information to rotate the page, you will find this closes most of the accuracy gap between Tesseract and Abbyy.