Closed YashMistry349 closed 2 years ago
The same should apply as in https://github.com/madmaze/pytesseract/issues/433#issuecomment-1145813474. Tesseract provides some default models which usually work quite well for a wide range of fonts (being trained with multiple ones), but might require fine-tuning for specific domains/fonts.
Closing this one because it is tesseract related issue/question and not specific to pytesseract itself. Thanks for helping other users @stefan6419846
I am facing challenge to extract correct a letter from a word which are look-alike, i.e 5 & S, I & 1, 8 & S.
I applied image pre-processing techniques like Blurring, erode, dilate, normalised the noise, remove unnecessary component and text detection from the input image but after these much of pre-processing tesseract OCR isn't giving correct result.
Please check attached images,
Original Image
Pre-processed Image
Detected Text
Tesseract Configuration
-l eng --oem 1 --psm 7 -c tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n" load_system_dawg=false load_freq_dawg=false
Result of OCR: TITLENUMBER 81003716
As we can see OCR extract S as 8 even after pre-processing and text detection.
Is there anyway we can overcome this problem?
Tesseract Version: tesseract 5.1.0-32-gf36c0