Closed gedw99 closed 2 months ago
I haven't needed OCR by now. However, my "corpus" seems to contain some scanned PDFs with no readable textual content that would profit from OCR. I know there is Tesseract and a decent Go wrapper available but I guess there is some complexity in getting layout right when a documents contains (real) text and OCR'able graphics etc.
However, if you or me were to add Tesseract I would prefer to have it as build time option, i.e. bound to a build tag, or some kind of weak/optional dependency (like wv
which might be installed or not), so users can decide whether to use it or not. I don't want to bloat this service or reduce its performance unnecessarily.
If you go forward implementing this, I will definitely consider merging it. But please be aware that this is not (yet) a community project. I have no experience in managing something like this.
Hey @johbar
Yep no pressure... Community project I know...
I agree that a build tag is needed because its got CGO and some nasty dependencies.
Or perhaps make the OCR its own binary ( or wasm ) and then use it via STDIO. NATS or other can then talk to it ?
If you have any ideas on this, am happy to fit it with how you do things.
I actually started implementing this.
Some design considerations:
$PATH
is the easiest way to avoid build-time dependencies. It's not the most performant though.go-pdfium
has some, too).
Considerations and experiences concerning how to use and integrate OCR:
Any thought on any of this is welcome.
Just remembered a second use case besides scanned docs: PDFs with embedded fonts with no unicode char mappings. It is impossible in these cases to get anything useful from the text, PDFium just returns garbage. OCRing the pages would deliver better results. But here again the complexity is to decide (in an automated way) when to use it. And to choose the right language(s).
Hey @johbar
read your design considerations …
they all seem really good to me .
like how we can just run off the binary of tesseract sitting in $path . Nice and easy
if you want me to give it a try etc just shout .
thanks be for considering this .
If you want I can help make testing examples etc
For example a make file thats works cross platform...
make an. issue for this ...
which package are you using for the OCR ?
Not too bad. Nut no support for c binary being external I think...
Have look at the feature branch
I implemented a Tesseract CLI wrapper myself and added some simple drop-in replacement implementations based on gosseract
and others. One is based on WASM and very slow, as I expected.
thanks @johbar
Wow that looks really good.
I think I should close this issue, since it seems to be complete for now.
I think I can add OCR too.