tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.72k stars 9.45k forks source link

Severe OCR quality detioration with "--dpi 300" versus "--dpi 299" #4321

Open brainsucker-na opened 1 week ago

brainsucker-na commented 1 week ago

Current Behavior

With --dpi 300 Tesseract produces the following mediocre results (misses a couple words) for attached sample image:

с or бухгалтерей право второй подписи документов для проведения расчетов по банковским и иным счетам клиента;

sample_crop sample_crop.zip

Full command line: tesseract.exe -l rus --dpi 300 sample_crop.png crop300dpi

Expected Behavior

With --dpi 299 for the same image Tesseract produces much better results:

Сведения о главном бухгалтере/лице, имеющем право второй подписи документов для проведения расчетов по банковским и иным счетам клиента;

Full command line: tesseract.exe -l rus --dpi 299 sample_crop.png crop299dpi

I would expect it to perform at similar level at --dpi 300.

Suggested Fix

No response

tesseract -v

tesseract v5.4.0.20240606 leptonica-1.84.1 libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 3.0.1) : libpng 1.6.43 : libtiff 4.6.0 : zlib 1.3 : libwebp 1.4.0 : libopenjp2 2.5.2 Found AVX2 Found AVX Found FMA Found SSE4.1 Found libarchive 3.7.4 zlib/1.3.1 liblzma/5.6.1 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.6

UB Mannheim build (setup) with default tessdata files

Operating System

Windows 10

Other Operating System

No response

uname -a

No response

Compiler

No response

CPU

i7-4700MQ

Virtualization / Containers

No response

Other Information

No response

stweil commented 1 week ago

I don't know any other OCR software which needs DPI information. Ideally Tesseract should work without it, too. Code contributions for this goal are welcome, but must make sure that there is no regression of course.