simonw / tools

Assorted tools
https://tools.simonwillison.net
Apache License 2.0
169 stars 18 forks source link

Version of OCR that can run entirely offline #2

Open simonw opened 5 months ago

simonw commented 5 months ago

Currently https://tools.simonwillison.net/ocr loads assets from a CDN.

A version that can run offline would be fantastic. It would be a tiny bit tricky to get versions of PDF.js and Tesseract.js (and their supporting files) that work like that, but it should absolutely be possible.

Ideally offer this as a zip file for people to download and run locally.

Could it be done such that it works from opening a HTML file in a browser, rather than needing a localhost web server? I don't think that works right now, but it may be possible with a bit more thought or some weird bundler magic.

steren commented 5 months ago

For maintenance and hosting simplicity, consider vendoring these dependencies.

Lewiscowles1986 commented 5 months ago

It would be a tiny bit tricky to get versions of PDF.js and Tesseract.js (and their supporting files) that work like that

Then link to those files.

I Think I must be missing something here. Is this like polyfill.js where the remote CDN is detecting the browser and serving slightly altered payloads?

matsklevstad commented 1 month ago

Is it possible to create a version that can handle images that are upside down or rotated? If so, how?

Lewiscowles1986 commented 1 month ago

@matsklevstad that feels like a valid, but separate issue to the thing running offline.