naptha / tesseract.js

Pure Javascript OCR for more than 100 Languages 📖🎉🖥
http://tesseract.projectnaptha.com/
Apache License 2.0
34.09k stars 2.15k forks source link

Tesseract - Running in Browser Console #918

Closed DigitalGyspy closed 2 months ago

DigitalGyspy commented 2 months ago

Sorry, I know this is not an issue at all, so I apologise for posting here.

I am trying to use Tesseract within the browser console. This is for a site for which is not my own so I cannot embed external scripts for security reasons, hence why I run my code in the browser console. However, it is only for personal use and for intention of reading data from a canvas image.

I embedded the JS from 'https://cdn.jsdelivr.net/npm/tesseract.js@5.0.5/dist/tesseract.min.js' and then ran the following from the examples page of Tesseract GitHub pages:

(async () => { const worker = await createWorker('eng'); const ret = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png'); console.log(ret.data.text); await worker.terminate(); })();

Which produces the following error (also because of security issues.): 'VM11097:6 Uncaught (in promise) Uncaught NetworkError: Failed to execute 'importScripts' on 'WorkerGlobalScope': The script at 'https://cdn.jsdelivr.net/npm/tesseract.js@v5.0.5/dist/worker.min.js' failed to load.'

Is it possible to compile the Tesseract JS files into a single block of code? Essentially, I want to bypass anything which involves referencing an external source.

Many thanks, greatly appreciated.

Balearica commented 2 months ago

While I can't say that bundling Tesseract.js into a single file is impossible (somebody could probably figure it out with enough time), it is not possible with our current projects structure/build system, and I cannot imagine that making it work would be easier than finding alternative ways to accomplish whatever you're trying to do.

Tesseract.js requires 4 files in total: the main thread scripts (tesseract.min.js), the worker thread scripts (worker.min.js), the webassembly build of Tesseract (tesseract-core-simd.wasm.js), and the language data (eng.traineddata.gz in this case). Therefore, you would need to figure out how to embed all 4 files in a single file.