ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
13.69k stars 997 forks source link

Custom pre-processing steps #424

Closed pranasziaukas closed 4 years ago

pranasziaukas commented 5 years ago

First and foremost, my hat's off to you for what appears to be an excellent PDF OCR'ing solution.

My use case is that there is a bunch of old documents, some of which have a low quality "spongy" text. Currently, --clean removes the background noise nicely but there is (AFAIK) nothing similar for text itself in OCRmyPDF.

I noticed a "plugin system" being mentioned elsewhere but I'm not sure what it is.

How would you recommend to approach this problem? I'm fine with writing Python scripts.

jbarlow83 commented 5 years ago

Yes, I have a plugin system in a development branch, but it didn't make the cut for the 9.0 release. I'd rather get it right even if it takes longer, so I don't have to break compatibility with plugins.

For now you might as well fork and make whatever changes seem appropriate. You could override the current behavior of --clean for example. Although I don't understand exactly what you're trying to do - cleanup output text?

pranasziaukas commented 5 years ago

I'm trying to enhance the readability of the text in an image (I was thinking about median or bilateral filtering to start with).

So I guess forking is the way to go for now. Do you have even the slightest idea when the plugin system might be released?

jbarlow83 commented 5 years ago

I think forking will be best for your case.

jbarlow83 commented 4 years ago

I'll close the issue now. If you have further related questions feel free to reopen it.