ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
13.54k stars 989 forks source link

Plugin Engine #717

Closed maldag closed 3 years ago

maldag commented 3 years ago

I'm working with this tool for years now and I'm trying to optimize my workflow further more. Is there a way to get the following tasks done:

jbarlow83 commented 3 years ago

Generally yes.

The worker processes get copies of the shared job context, but they do not have a way to pass page-level decisions back because ocrmypdf doesn't have need information flowing in that direction. So if, for example, your plugin used filter_ocr_image to process a page image match it to templates, you'd need some external data source to put this in, to allow for the possibility that a different process queries it later. A small database would do the job, or multiprocessing.Manager.

For something like renaming output files I suggest calling ocrmypdf's API in a script, saving the file to a fixed location, and moving it there. That seems easier to me than injecting that logic into ocrmypdf with a plugin.

I can probably add new plugin hooks if you identify a need for them.

maldag commented 3 years ago

So you'd think it might be better to use a separate python project to do this? Is there any need here for such feature? To me that looks like the logical next step.

jbarlow83 commented 3 years ago

Whether a plugin goes into a separate or is just a standalone script is a question of how you want to organize it.

Is there any need here for such feature?

Which feature are you referring to?

maldag commented 3 years ago

Whether a plugin goes into a separate or is just a standalone script is a question of how you want to organize it.

Is there any need here for such feature?

Which feature are you referring to?

I was thinking about extending this project to a light dms, where information gets extracted from recognized text.

jbarlow83 commented 3 years ago

Have a look at paperless-ng in case it fits.

If you're adding several thousand new lines of code that sounds more like you want to encapsulate it in a project - possibly one that supplies a plugin or two to help get data out.

jbarlow83 commented 3 years ago

I'll close the issue now. If you have further related questions feel free to reopen it.