tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.11k stars 9.39k forks source link

Using tesseract for generating searchable PDF with images containing multiple orientation text blocks #2055

Open sreeni5493 opened 5 years ago

sreeni5493 commented 5 years ago

test

In the above example image, it has two simple text blocks, one horizontal text block "Hello WORLD" and another vertical text block "HI WORLD". When I use tesseract, it identifies only one text correctly (The horizontal one) and the other text produces wrong results and mostly garbage.

I work with Artwork PDFs and I need to do OCR to obtain searchable PDF. I can ofcourse do some simple pre-processing like removing lines, binarization and so on to give only text. But tesseract detects only the dominant orientation and ignores text in other orientations and produces bad results for other non-dominant orientation text. Is there a way I could give segment block locations (x, y, width and height for each individual block) maybe so that it can detect text for all these blocks and put it back in a searchable PDF form with the original image? How can I help tesseract improve accuracy for text blocks with multiple orientation.

amitdo commented 5 years ago

Please post the full command you used.

sreeni5493 commented 5 years ago

tesseract sample_image.png sample_image_tesseract --psm 1 pdf

CanadianHusky commented 5 years ago

I would be highly interested in such a feature as well and believe that it has great value. I read the notes in the "Projects" tab for future versions and the wish to collect ideas and requirements. Improved "searchable PDF" functions like multiple orientation in content (or resolution enhancements as mentioned in #2108 too) would certainly be a good subject that is very relevant to a large audience. Thank you

jbreiden commented 5 years ago

Tesseract's PDF generation should already happily handle this, assuming Tesseract gets the recognition right.