tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
62.56k stars 9.53k forks source link

Inconsistencies in detection and extraction of text using tesseract #4255

Open saanvib13 opened 5 months ago

saanvib13 commented 5 months ago

Your Feature Request

I have provided the image from which I am trying to extract text from, using tesseract ocr. output

Along with that, I have also provided the result or the extracted text from the image. input

As it can be observed from the images, the extracted text is not very accurate. Negative symbols have been omitted, some undesired characters are also there in the extracted text. (I have marked some of the incorrect results with blue boxes) I have tried to improve the results by preprocessing and bringing changes in the parameters of the model. I have tried:

  1. binarizing the images
  2. HDR processing of the processes Even then, such inconsistencies remain.

How to improve the detection and extraction of text in tesseract? I have also tried paddleocr for the same task. Even then, symbols such as euro, some negative signs are not being detected.

zdenop commented 5 months ago

What about reading documentation?

saanvib13 commented 5 months ago

@zdenop Thank you for your response. I tried each and every step mentioned in this documentation. Even then, some decimal points are being omitted such as 22.5 is being misunderstood as 225. Moreover some numbers and being wrongly detected, such as -9 is being extracted as = ). Some negative symbols are also being omitted. I have tried preprocessing the images and have implemented the following:

  1. noise removal
  2. canny edge detection
  3. hough line transform
  4. binarization
  5. hdr processing

Pls provide your guidance and help me resolve this issue.

zdenop commented 5 months ago

And what did you learn about table recognition? What forum posts about table recognition, what other issues are stated about table recognition? You should check these sources BEFORE posting the issue.

rmast commented 5 months ago

This mod seems to do a slightly better job, still not flawless... image