johbar / text-extraction-service

A simple Golang service for extracting textual content from documents
GNU General Public License v3.0
1 stars 1 forks source link

OCR too #3

Closed gedw99 closed 2 months ago

gedw99 commented 2 months ago

I think I can add OCR too.

johbar commented 2 months ago

I haven't needed OCR by now. However, my "corpus" seems to contain some scanned PDFs with no readable textual content that would profit from OCR. I know there is Tesseract and a decent Go wrapper available but I guess there is some complexity in getting layout right when a documents contains (real) text and OCR'able graphics etc.

However, if you or me were to add Tesseract I would prefer to have it as build time option, i.e. bound to a build tag, or some kind of weak/optional dependency (like wv which might be installed or not), so users can decide whether to use it or not. I don't want to bloat this service or reduce its performance unnecessarily.

If you go forward implementing this, I will definitely consider merging it. But please be aware that this is not (yet) a community project. I have no experience in managing something like this.

gedw99 commented 2 months ago

Hey @johbar

Yep no pressure... Community project I know...

I agree that a build tag is needed because its got CGO and some nasty dependencies.

Or perhaps make the OCR its own binary ( or wasm ) and then use it via STDIO. NATS or other can then talk to it ?

If you have any ideas on this, am happy to fit it with how you do things.

johbar commented 2 months ago

I actually started implementing this.

Some design considerations:

Considerations and experiences concerning how to use and integrate OCR:

Any thought on any of this is welcome.

johbar commented 2 months ago

Just remembered a second use case besides scanned docs: PDFs with embedded fonts with no unicode char mappings. It is impossible in these cases to get anything useful from the text, PDFium just returns garbage. OCRing the pages would deliver better results. But here again the complexity is to decide (in an automated way) when to use it. And to choose the right language(s).

gedw99 commented 2 months ago

Hey @johbar

read your design considerations …

they all seem really good to me .

like how we can just run off the binary of tesseract sitting in $path . Nice and easy

if you want me to give it a try etc just shout .

thanks be for considering this .

gedw99 commented 2 months ago

If you want I can help make testing examples etc

For example a make file thats works cross platform...

make an. issue for this ...

gedw99 commented 2 months ago

which package are you using for the OCR ?

Not too bad. Nut no support for c binary being external I think...

https://github.com/otiai10/gosseract

https://github.com/otiai10/ocrserver

johbar commented 2 months ago

Have look at the feature branch

I implemented a Tesseract CLI wrapper myself and added some simple drop-in replacement implementations based on gosseract and others. One is based on WASM and very slow, as I expected.

gedw99 commented 2 months ago

thanks @johbar

Wow that looks really good.

I think I should close this issue, since it seems to be complete for now.