tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.28k stars 9.41k forks source link

Having trouble with simple OCR #4088

Open chaudhryfaisal opened 1 year ago

chaudhryfaisal commented 1 year ago

Current Behavior

!tesseract score.jpg test --oem 1 -l eng --psm 11; cat test.txt

Yields to

Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 335
LAJOVIC

0 ty

Beat)

0

Expected Behavior

Proper prediction

LAJOVIC...0 0
CRESSY 0 15

Suggested Fix

No response

tesseract -v

tesseract 4.1.1 leptonica-1.79.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1 Found AVX2 Found AVX Found FMA Found SSE Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4

Operating System

Ubuntu 20.04 Focal

Other Operating System

NAME="Ubuntu" VERSION="20.04.5 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.5 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal

uname -a

Linux 8a4ef56ba3d1 5.15.107+ #1 SMP Sat Apr 29 09:15:28 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Compiler

No response

CPU

No response

Virtualization / Containers

No response

Other Information

score

naourass commented 1 year ago

You need to preprocess the image for the ocr to work properly, especially binarizing/thresholding the image: https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html#binarisation