OCR performance degrades in relation to larger image sizes

ryanbarr commented 2 years ago

Is your feature request related to a problem? Please describe. As image sizes increase, the OCR performance is reduced. This is due to the OCR library provided by Tesseract.js both relying on WebAssembly and an outdated version of Tesseract.

Describe the solution you'd like The ideal solution is to end up using Tesseract v5 with the latest "fast" training data provided at tessdata_fast.

Describe alternatives you've considered The options for proceeding are as follows:

Wait for tesseract.js to be updated. This is an unlikely and ideal scenario. At the time of writing this, the only changes made in the last year have been one to implement three patch-level pull requests from contributors and one to add an Ethereum wallet to the FUNDING.yml.
Implement a patch from a tesseract.js fork. This is a low-effort fix that would pin our version to a forked version. There is a PR with a viable solution and proven performance gains in this PR: naptha/tesseract.js-core#18
Use a compiled binary and execute it from the command line. We can bundle the compiled Tesseract binary with the application and execute it with Node's child_process API and make it asynchronous with util.promisify. The downside here is having to bundle binaries for Unix vs Windows, then write a shell script that calls the correct binary based on OS.
Compile a binary based on target and expose C++ functions with Node-API. Using Node-API, we can write a C++ wrapper for the Tesseract API functions we need. Then, exposing those functions to Node via Node-API, we can execute the functions with JavaScript. This would create a maintenance dependency on manually updating Tesseract versions and compiling the correct Tesseract binaries for our build target dynamically. An example of this can be seen at node-native-ocr, which without a hard dependency on Webpack, would be a perfect solution for our needs.
Write a new WebAssembly library that replaces the Tesseract.js functionality. While this would be the most significant lift, this solution would result in a freshly maintained library that can be open sourced.

Additional context N/A

ryanbarr commented 2 years ago

I have made a call for maintainers on Tesseract.js in relation to the first option: naptha/tesseract.js#623

Balearica commented 2 years ago

@ryanbarr I created the pull request to update to Tesseract 5, and have significantly faster version of Tesseract.js for my document digitization website scribeocr.com, which implements other changes as well. Feel free to try out the site to see if the version I have meets your performance expectations and I can explain further (as it's not quite a drop-in replacement for Tesseract.js). To allow for selecting between Tesseract Legacy or LSTM engines, use Info -> Optional Features -> Advanced Recognition Options. I contacted a maintainer of the Tesseract.js project regarding their intentions with the project, so we'll see if they respond.

Additionally, I would consider using the Tesseract Legacy engine (oem value 0) rather than Tesseract LSTM. In my testing both engines had similar accuracy when recognizing high-quality text, however the LSTM engine (the default) takes significantly longer to run.

ryanbarr commented 2 years ago

@Balearica Interesting! The performance on your website definitely feels better. I ran into an issue after a few runs where it wasn't allowing me to run any more tests and was throwing an XML error in the console though -- just a heads up.

I noticed you were added as a maintainer. I'll wait for the update to be merged to Tesseract.js but am happy to entertain a direct implementation of Tesseract if you believe the update will take more than a week to be released.

ryanbarr / harvest-monster

OCR performance degrades in relation to larger image sizes #48