Closed Friskes closed 2 months ago
This step is done per worker process, so each process to create its own. After creating a Reader, the object is used for each page that worker processes, so for high page count documents we get optimal throughput. It's only if you have a lot of small documents that easyocr is less optimal.
This is the earliest it can be done without having some way to persist worker process across OCR jobs which would be a major overhaul of core ocrmypdf and breaking compatibility with all other plugins, because the current multiprocessing setup is that we create workers on demand for a specific task and then dispose of them.
I noticed that the creation of easyocr.Reader(languages, use_gpu) in def _ocr_process takes more than a second, at the time of starting recognition, it would be good to do this in advance if possible, so that at the time of the recognition request you do not waste an extra second