Closed maldag closed 3 years ago
Generally yes.
The worker processes get copies of the shared job context, but they do not have a way to pass page-level decisions back because ocrmypdf doesn't have need information flowing in that direction. So if, for example, your plugin used filter_ocr_image
to process a page image match it to templates, you'd need some external data source to put this in, to allow for the possibility that a different process queries it later. A small database would do the job, or multiprocessing.Manager
.
For something like renaming output files I suggest calling ocrmypdf's API in a script, saving the file to a fixed location, and moving it there. That seems easier to me than injecting that logic into ocrmypdf with a plugin.
I can probably add new plugin hooks if you identify a need for them.
So you'd think it might be better to use a separate python project to do this? Is there any need here for such feature? To me that looks like the logical next step.
Whether a plugin goes into a separate or is just a standalone script is a question of how you want to organize it.
Is there any need here for such feature?
Which feature are you referring to?
Whether a plugin goes into a separate or is just a standalone script is a question of how you want to organize it.
Is there any need here for such feature?
Which feature are you referring to?
I was thinking about extending this project to a light dms, where information gets extracted from recognized text.
Have a look at paperless-ng in case it fits.
If you're adding several thousand new lines of code that sounds more like you want to encapsulate it in a project - possibly one that supplies a plugin or two to help get data out.
I'll close the issue now. If you have further related questions feel free to reopen it.
I'm working with this tool for years now and I'm trying to optimize my workflow further more. Is there a way to get the following tasks done: