Closed phu54321 closed 5 months ago
Is there any way we could share it across 'documents' as well. I ocr many single page small documents. Receipts.
I know this rabbit hole gets deep quickly. but I don't think I'm alone.
Could we: 1) check if there's an instance running. 2) if not, fork to start one. 3) then use the running instance.
The running instance would then have some code that if its not used in 60 seconds to terminate? I'd hope to be able to adjust that timer as well. I'd probably go for 300 seconds in my case as the script that calls ocrmypdf is rather sloppy and traverses a whole lot of pdfs to find just a few that need it.
You're welcome to contribute changes/PRs but until then it will have to wait I get chance to look more seriously. No ETA, and as the README and all says this is not production ready. I only have so much time....
You'd probably get better results by spawning a worker process that owns the sole instance of EasyOCR.Reader() and having other threads/processes feed its queue. I don't know if having multiple Readers around makes EasyOCR go any faster.
I'm actually writing a PR or implementation, but there is some problem here:
OCRmypdf defaults to StandardExecutor (multiprocessing) so local variable aren't shared between processes. To have a single GPU-only worker process, we need a way to share single Queue or variables across different workers.
Like, allow adding custom parameters to each worker invocations, or forcing 'use_threads' option on a plugin side
The plugin documentation outlines the context by which hooks are called. You could implement a hook for check_options()
to create a GPU worker and a queue to communicate with it, and attach whatever state is needed to options
, e.g. options._easyocr_gpu_queue
. The options
object is passed to workers. Then you will need to change the implementation of EasyOcrEngine to put a message in the queue and signal completion back to workers (perhaps each worker creates a completion queue, sends its request to the worker, and the OCR worker pings the specified completion queue when done).
multiprocessing.Queue
is thread and process safe, supports multi-producers, and is capable of marshaling itself across process boundaries, so it should do everything needed.
If it ends up being that forcing use_threads
is required, plugins are allowed to do things like that (although they explain this to the user).
with simple profiling code:
Time it takes to construct
easyocr.Reader
is quite significant. If the code can re-use reader object across pages, OCRing time could be cut considerably.