Investigate improving recognition performance for a single image using threads

robertknight / tesseract-wasm

JS/WebAssembly build of the Tesseract OCR engine for use in browsers and Node

BSD 2-Clause "Simplified" License

247 stars 26 forks source link

There are several ways that threads could be used to speed up recognition:

For a multi-page document, create a pool of OCRClients and distribute images across them. This can already be done today, but won't help for single document images, and requires duplicating a lot of resources in each worker, so is not memory-efficient.
Split a large image into smaller ones and distribute the sub-images to a pool of OCRClients. This is possible today, but again results in a lot of resource duplication, and requires work to figure out how to split up the image.
Tesseract supports OpenMP, but according to comments in the GitHub repo, it isn't that effective at present as it is probably too low-level
Since text recognition is the expensive step, a middle ground would be to do detection in one thread, then use multiple threads to perform recognition of different text lines that were found in parallel. There is a comment in the Tesseract code to the effect that it might be a good win

Since text recognition is the expensive step, a middle ground would be to do detection in one thread, then use multiple threads to perform recognition of different text lines that were found in parallel.

Rough plan for this:

Add option to OCRClient to use auxiliary workers for text recognition. The option balances speed vs memory usage. One of the workers would be designated the main worker, others would be recognition workers. The recognition workers might need to be created on-demand
loadModel would load the model into all workers, or if recognition workers are created on-demand, save a copy of the model data for transfer to recognition workers when they are created subsequently
loadImage would load the image into the main worker
OCRClient's text recognition methods will query the main worker to fetch text line images (TBD: full color? greyscale? binarized?). Batches of line images will then be distributed to the recognition workers which will run recognition on them, treating each input image as a single line, and return the results (text + bounding boxes). The coordinates will then be adjusted to reflect the input line's position in the original image.

robertknight / tesseract-wasm

Investigate improving recognition performance for a single image using threads #27