robertknight / tesseract-wasm

JS/WebAssembly build of the Tesseract OCR engine for use in browsers and Node
https://robertknight.github.io/tesseract-wasm/
BSD 2-Clause "Simplified" License
247 stars 26 forks source link

Investigate improving recognition performance for a single image using threads #27

Open robertknight opened 2 years ago

robertknight commented 2 years ago

There are several ways that threads could be used to speed up recognition:

  1. For a multi-page document, create a pool of OCRClients and distribute images across them. This can already be done today, but won't help for single document images, and requires duplicating a lot of resources in each worker, so is not memory-efficient.
  2. Split a large image into smaller ones and distribute the sub-images to a pool of OCRClients. This is possible today, but again results in a lot of resource duplication, and requires work to figure out how to split up the image.
  3. Tesseract supports OpenMP, but according to comments in the GitHub repo, it isn't that effective at present as it is probably too low-level
  4. Since text recognition is the expensive step, a middle ground would be to do detection in one thread, then use multiple threads to perform recognition of different text lines that were found in parallel. There is a comment in the Tesseract code to the effect that it might be a good win
robertknight commented 2 years ago

Since text recognition is the expensive step, a middle ground would be to do detection in one thread, then use multiple threads to perform recognition of different text lines that were found in parallel.

Rough plan for this: