madmaze / pytesseract

A Python wrapper for Google Tesseract
Apache License 2.0
5.84k stars 721 forks source link

Tesseract confused between a character and a digit which look-alike #432

Closed YashMistry349 closed 2 years ago

YashMistry349 commented 2 years ago

I am facing challenge to extract correct a letter from a word which are look-alike, i.e 5 & S, I & 1, 8 & S.

I applied image pre-processing techniques like Blurring, erode, dilate, normalised the noise, remove unnecessary component and text detection from the input image but after these much of pre-processing tesseract OCR isn't giving correct result.

Please check attached images,

Original Image

original

Pre-processed Image

pre_processed

Detected Text

detected_text1 detected_text2

Tesseract Configuration

-l eng --oem 1 --psm 7 -c tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n" load_system_dawg=false load_freq_dawg=false

Result of OCR: TITLENUMBER 81003716

As we can see OCR extract S as 8 even after pre-processing and text detection.

Is there anyway we can overcome this problem?

Tesseract Version: tesseract 5.1.0-32-gf36c0

stefan6419846 commented 2 years ago

The same should apply as in https://github.com/madmaze/pytesseract/issues/433#issuecomment-1145813474. Tesseract provides some default models which usually work quite well for a wide range of fonts (being trained with multiple ones), but might require fine-tuning for specific domains/fonts.

bozhodimitrov commented 2 years ago

Closing this one because it is tesseract related issue/question and not specific to pytesseract itself. Thanks for helping other users @stefan6419846