ocrmypdf / OCRmyPDF-EasyOCR

OCRmyPDF EasyOCR plugin
MIT License
34 stars 6 forks source link

Share `easyocr.Reader` instance #3

Closed phu54321 closed 5 months ago

phu54321 commented 6 months ago

with simple profiling code:

        with GPU_SEMAPHORE:
            s0 = time.time()
            reader = easyocr.Reader(languages, gpu=options.gpu)
            s1 = time.time()
            raw_results = reader.readtext(gray)
            s2 = time.time()
            print('reader init: %.1fs, readtext: %.1fs' % (s1 - s0, s2 - s1))

Time it takes to construct easyocr.Reader is quite significant. If the code can re-use reader object across pages, OCRing time could be cut considerably.

image

BillyCroan commented 5 months ago

Is there any way we could share it across 'documents' as well. I ocr many single page small documents. Receipts.

I know this rabbit hole gets deep quickly. but I don't think I'm alone.

Could we: 1) check if there's an instance running. 2) if not, fork to start one. 3) then use the running instance.

The running instance would then have some code that if its not used in 60 seconds to terminate? I'd hope to be able to adjust that timer as well. I'd probably go for 300 seconds in my case as the script that calls ocrmypdf is rather sloppy and traverses a whole lot of pdfs to find just a few that need it.

jbarlow83 commented 5 months ago

You're welcome to contribute changes/PRs but until then it will have to wait I get chance to look more seriously. No ETA, and as the README and all says this is not production ready. I only have so much time....

You'd probably get better results by spawning a worker process that owns the sole instance of EasyOCR.Reader() and having other threads/processes feed its queue. I don't know if having multiple Readers around makes EasyOCR go any faster.

phu54321 commented 5 months ago

I'm actually writing a PR or implementation, but there is some problem here:

OCRmypdf defaults to StandardExecutor (multiprocessing) so local variable aren't shared between processes. To have a single GPU-only worker process, we need a way to share single Queue or variables across different workers.

Like, allow adding custom parameters to each worker invocations, or forcing 'use_threads' option on a plugin side

jbarlow83 commented 5 months ago

The plugin documentation outlines the context by which hooks are called. You could implement a hook for check_options() to create a GPU worker and a queue to communicate with it, and attach whatever state is needed to options, e.g. options._easyocr_gpu_queue. The options object is passed to workers. Then you will need to change the implementation of EasyOcrEngine to put a message in the queue and signal completion back to workers (perhaps each worker creates a completion queue, sends its request to the worker, and the OCR worker pings the specified completion queue when done).

multiprocessing.Queue is thread and process safe, supports multi-producers, and is capable of marshaling itself across process boundaries, so it should do everything needed.

If it ends up being that forcing use_threads is required, plugins are allowed to do things like that (although they explain this to the user).