tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
59.53k stars 9.23k forks source link

OCR from grayscale TIFFs produces inconsistent results #4268

Closed ed-epiq closed 3 weeks ago

ed-epiq commented 3 weeks ago

Current Behavior

TIFFs which look the same to the user but slightly vary in size result in completely different extracted text. Please see the samples in attached ZIP.

Our invocation is sudo tesseract .tif -l eng

Expected Behavior

All the text from the image should be extracted. Please see the sample in attached ZIP.

Suggested Fix

No response

tesseract -v

We tried 2 versions:

tesseract 5.3.3 leptonica-1.83.1 libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1 Found AVX512BW Found AVX512F Found AVX2 Found AVX Found FMA Found SSE4.1 Found OpenMP 201511

... and ...

tesseract 5.4.0-rc2-17-g3469 leptonica-1.79.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1 Found AVX2 Found AVX Found FMA Found SSE4.1 Found OpenMP 201511 Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4 Found libcurl/7.68.0 OpenSSL/1.1.1f zlib/1.2.11 brotli/1.0.7 libidn2/2.2.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 librtmp/2.3

Operating System

No response

Other Operating System

Ubuntu 20.04.6 LTS ("focal")

uname -a

Linux ip-xx-xxx-xx-xxx d.dd.d-dddd-aws #68~20.04.1-Ubuntu SMP Wed May 1 15:24:09 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Compiler

we invoke tesseract as a command

CPU

hosted in AWS

Virtualization / Containers

tesseract 5.4.0-rc2-17-g3469 was used on Docker tesseract 5.3.3 was used in AWS

Other Information

Please see the attached ZIP and read its REAME.txt for further information. TesseractOCRissueForTIFFs.zip

stweil commented 3 weeks ago

That's normal, because Tesseract's image processing and layout detection was never designed for such images. And in a short test it detects surprisingly much text:

$ convert M01051\ poster.pdf Image.jpg
$ tesseract Image.jpg - -l script/Latin

NewPower Sign up today!

Connections

24/ 7 NewPower

Energy Manager

ANYWHERE, ANYTIME

 InternetIP
Network

personal
; digital
assistant
WALKING

computer P

] m Pa thermostat

OFFICE

= outdoor lighting

NewPower | InternetHomeAlliance l COACTIVE' min, | SEARS.

Connections NETWORKS

Broadband / Phoneline gateway

existing powerline

FUTURE
CONNECTIONS

waterheater

K
ed-epiq commented 3 weeks ago

I appreciate the quick response, Stefan! Thanks for your analysis on JPG. But what about the TIFFs, which are part of our established process flow. Could you try those on your end (from the provided ZIP)?

stweil commented 3 weeks ago

Those TIFF files give similar bad results for me like in your tests. If you use different values for the DPI (by adding --dpi 600 for example) the results change. You can also try --psm 12 (which will find a lot of relevant text and also a huge amount of wrong text) and many more parameters, but I am afraid that nobody here has the time to help you with recognition issues.

Therefore I close this issue.