Tesseract give wrong result for this low quality image

tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)

https://tesseract-ocr.github.io/

Apache License 2.0

62.3k stars 9.52k forks source link

Tesseract give wrong result for this low quality image #2182

Closed dviettu134 closed 5 years ago

dviettu134 commented 5 years ago

Environment

Tesseract Version: tesseract v4.0.0.20181030
Platform: Windows 8.1 64 bit

Current Behavior:

I run tesseract for the blow image and get the result "ais" test

Expected Behavior:

The expected result should be: "CTx2 40/5A"

Suggested Fix:

Unknown

PranavArora018 commented 5 years ago

You should try by changing the tesseract config and applying some pre processing. This is what I got "Ctx auy5n"

I know it is not perfect but still can be related.

my config: l-eng oem-3 psm-6 (oem-3 Default, based on what is available. psm-6 Assume a single uniform block of text.)

Image processing: blurring, dilating, and normalizing

Tesseract Version: tesseract v4.0.0.20181030 Platform: Windows 10 64 bit

Cheers! :)

dviettu134 commented 5 years ago

Hi PranavArora018, Thanks for your suggestions.

Actually I'm having to perform OCR on a larger image with text segments in this quality (sorry that I could not provide the whole image here due to confidential problem). So the config psm=6 may not work here.

Do you think retrain the tesseract model may help in this case?

rajeshkalpathi commented 5 years ago

Did you try image pre-processing. Looking at the image, you might use one of the following EmguCV methods

Dilation to sharpen the image.
Threshold the image so that grey soft areas become white and black areas are well defined.

Try with different parameters.

Please post the result if possible.

zdenop commented 5 years ago

I am afraid it is not realist expectation to get correct output from bad input. Even I am not sure if I see there 5A or SA... When I check how tesseract segmented your image to symbols I got 3 overlapping boxes:

So there is not enough space to correctly split input image to symbols. Anyway with psm 11 and tessdata_bestI got output: 40/5 rest of image is ignored...