Plugin Engine - Githubissues

ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

http://ocrmypdf.readthedocs.io/

Mozilla Public License 2.0

13.54k stars 989 forks source link

Plugin Engine #717

Closed maldag closed 3 years ago

maldag commented 3 years ago

I'm working with this tool for years now and I'm trying to optimize my workflow further more. Is there a way to get the following tasks done:

Extraction of the following keys from the image:
- Date of the paper
- Company sending it (by corporate logo using opencv-template matching or text database)
- Topic of letter if available
- Some letter say ("your key" or "our key") based on database records (which would mostly work when the corresponding company was identified first)
Automatic naming of output files based on extracted information in files like above?

jbarlow83 commented 3 years ago

Generally yes.

The worker processes get copies of the shared job context, but they do not have a way to pass page-level decisions back because ocrmypdf doesn't have need information flowing in that direction. So if, for example, your plugin used filter_ocr_image to process a page image match it to templates, you'd need some external data source to put this in, to allow for the possibility that a different process queries it later. A small database would do the job, or multiprocessing.Manager.

For something like renaming output files I suggest calling ocrmypdf's API in a script, saving the file to a fixed location, and moving it there. That seems easier to me than injecting that logic into ocrmypdf with a plugin.

I can probably add new plugin hooks if you identify a need for them.

maldag commented 3 years ago

So you'd think it might be better to use a separate python project to do this? Is there any need here for such feature? To me that looks like the logical next step.

jbarlow83 commented 3 years ago

Whether a plugin goes into a separate or is just a standalone script is a question of how you want to organize it.

Is there any need here for such feature?

Which feature are you referring to?

maldag commented 3 years ago

Whether a plugin goes into a separate or is just a standalone script is a question of how you want to organize it.

Is there any need here for such feature?

Which feature are you referring to?

I was thinking about extending this project to a light dms, where information gets extracted from recognized text.

jbarlow83 commented 3 years ago

Have a look at paperless-ng in case it fits.

If you're adding several thousand new lines of code that sounds more like you want to encapsulate it in a project - possibly one that supplies a plugin or two to help get data out.

jbarlow83 commented 3 years ago

I'll close the issue now. If you have further related questions feel free to reopen it.