Closed ryanbarr closed 2 years ago
I have made a call for maintainers on Tesseract.js in relation to the first option: naptha/tesseract.js#623
@ryanbarr I created the pull request to update to Tesseract 5, and have significantly faster version of Tesseract.js for my document digitization website scribeocr.com, which implements other changes as well. Feel free to try out the site to see if the version I have meets your performance expectations and I can explain further (as it's not quite a drop-in replacement for Tesseract.js). To allow for selecting between Tesseract Legacy or LSTM engines, use Info
-> Optional Features
-> Advanced Recognition Options
. I contacted a maintainer of the Tesseract.js project regarding their intentions with the project, so we'll see if they respond.
Additionally, I would consider using the Tesseract Legacy engine (oem
value 0
) rather than Tesseract LSTM. In my testing both engines had similar accuracy when recognizing high-quality text, however the LSTM engine (the default) takes significantly longer to run.
@Balearica Interesting! The performance on your website definitely feels better. I ran into an issue after a few runs where it wasn't allowing me to run any more tests and was throwing an XML error in the console though -- just a heads up.
I noticed you were added as a maintainer. I'll wait for the update to be merged to Tesseract.js but am happy to entertain a direct implementation of Tesseract if you believe the update will take more than a week to be released.
Is your feature request related to a problem? Please describe. As image sizes increase, the OCR performance is reduced. This is due to the OCR library provided by Tesseract.js both relying on WebAssembly and an outdated version of Tesseract.
Describe the solution you'd like The ideal solution is to end up using Tesseract v5 with the latest "fast" training data provided at tessdata_fast.
Describe alternatives you've considered The options for proceeding are as follows:
tesseract.js
to be updated. This is an unlikely and ideal scenario. At the time of writing this, the only changes made in the last year have been one to implement three patch-level pull requests from contributors and one to add an Ethereum wallet to theFUNDING.yml
.tesseract.js
fork. This is a low-effort fix that would pin our version to a forked version. There is a PR with a viable solution and proven performance gains in this PR: naptha/tesseract.js-core#18util.promisify
. The downside here is having to bundle binaries for Unix vs Windows, then write a shell script that calls the correct binary based on OS.Additional context N/A