ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.12k stars 1.02k forks source link

Add "ROI detection" pre-processing option to improve OCR results? #892

Open dtmland opened 2 years ago

dtmland commented 2 years ago

Fantastic project, performs the exact function I was hoping to find.

However, as the documentation states ocrmypdf is subject to limitations inherited by tesseract. After attempting to use some highschool yearbook scans as input (many similar examples available with a free account on classmates.com), I eventually became aware of limitations associated with tesseract's page segmentation mode(s). If OCR was performed on a block of text, it worked fantastic. If a block of text didn't get detected by tesseract, no OCR was performed in the block at all.

I've since come across several online resources that attempt solutions to this type of problem, for example before sending any content to tesseract a pre-process ROI "text block" detection can be performed using tools such as opencv or even camelot as you recommend in one case.

In fact, the general implementation as described in the table issue thread seems as though it could apply generally to blocks of text in documents such as the yearbook, and perhaps apply generally to any scan a user may pass to ocrmypdf? Before finding that thread I actually attempted a manual pre-process of similar nature - selecting a block of text on a page and inverting my selection to perform a white out of all content except the block. Passing this "masked all except the block of text" pdf to tesseract results in successful block detection and subsequent OCR of the text.

It would be great to have an option added to mainline ocrmypdf that performs preprocessing and passes such individual blocks of content to tesseract to improve its likelihood of success. Perhaps called 'ROI detection' pre-processing or something of this nature.

jbarlow83 commented 2 years ago

It's possible replace the OCR engine with a plugin of your choosing or use plugins to manipulate what the OCR engine sees.

It's unlikely I'll have the time to work on this any time soon.