OCR too - Githubissues

gedw99 commented 2 months ago

I think I can add OCR too.

johbar commented 2 months ago

I haven't needed OCR by now. However, my "corpus" seems to contain some scanned PDFs with no readable textual content that would profit from OCR. I know there is Tesseract and a decent Go wrapper available but I guess there is some complexity in getting layout right when a documents contains (real) text and OCR'able graphics etc.

However, if you or me were to add Tesseract I would prefer to have it as build time option, i.e. bound to a build tag, or some kind of weak/optional dependency (like wv which might be installed or not), so users can decide whether to use it or not. I don't want to bloat this service or reduce its performance unnecessarily.

If you go forward implementing this, I will definitely consider merging it. But please be aware that this is not (yet) a community project. I have no experience in managing something like this.

gedw99 commented 2 months ago

Hey @johbar

Yep no pressure... Community project I know...

I agree that a build tag is needed because its got CGO and some nasty dependencies.

Or perhaps make the OCR its own binary ( or wasm ) and then use it via STDIO. NATS or other can then talk to it ?

If you have any ideas on this, am happy to fit it with how you do things.

johbar commented 2 months ago

I actually started implementing this.

Some design considerations:

Using the Tesseract CLI after testing if it is in $PATH is the easiest way to avoid build-time dependencies. It's not the most performant though.
Using a build tag to swap the CLI-based implementation for an API-based via dynamic linking using a Go-Wrapper seems appropriate. Not to much effort, just a bit confusing and unhandy with all the build tags (go-pdfium has some, too).
- User can then decide to use it when building TES.
- Thread-safety could be a problem though. Might need a mutex or multiple instances of the Tesseract BaseAPI which might impact performance one again.

Considerations and experiences concerning how to use and integrate OCR:

Heuristic: When a page contains no text, feed it to Tesseract (when it is available).
- Problems here: Extracting images from PDFs is non-trival. But rendering the hole page as JPEG/PNG is feasible with PDFium.
- Could give pdfcpu (pure Go) a try for extracting Images. Would have to load the PDF a second time in memory for this though...
Implement a separate REST endpoint for OCR supporting JPEG, TIFF, GIF, PNG etc. and PDF. Clients can send/request files there.
In any case: Tesseract needs language (and script) specific models to be installed and a config property about which languages to use. An operator running TES or the client using the REST API would have to configure this. So there is some knowledge needed upfront about the material to be processed.

Any thought on any of this is welcome.

johbar commented 2 months ago

Just remembered a second use case besides scanned docs: PDFs with embedded fonts with no unicode char mappings. It is impossible in these cases to get anything useful from the text, PDFium just returns garbage. OCRing the pages would deliver better results. But here again the complexity is to decide (in an automated way) when to use it. And to choose the right language(s).

gedw99 commented 2 months ago

Hey @johbar

read your design considerations …

they all seem really good to me .

like how we can just run off the binary of tesseract sitting in $path . Nice and easy

if you want me to give it a try etc just shout .

thanks be for considering this .

gedw99 commented 2 months ago

If you want I can help make testing examples etc

For example a make file thats works cross platform...

make an. issue for this ...

gedw99 commented 2 months ago

which package are you using for the OCR ?

Not too bad. Nut no support for c binary being external I think...

https://github.com/otiai10/gosseract

https://github.com/otiai10/ocrserver

johbar commented 2 months ago

Have look at the feature branch

I implemented a Tesseract CLI wrapper myself and added some simple drop-in replacement implementations based on gosseract and others. One is based on WASM and very slow, as I expected.

gedw99 commented 2 months ago

thanks @johbar

Wow that looks really good.

I think I should close this issue, since it seems to be complete for now.

johbar / text-extraction-service

OCR too #3