otiai10 / gosseract

Go package for OCR (Optical Character Recognition), by using Tesseract C++ library
https://pkg.go.dev/github.com/otiai10/gosseract
MIT License
2.69k stars 289 forks source link

Losing stderr when running multiple clients at once #316

Open uniwisejohannes opened 1 week ago

uniwisejohannes commented 1 week ago

Sorry if this is silly, but we wanted to hear the developer's thoughts on this.

In our project we initialise 4 different gosseract.Clients one at a time using gosseract.NewClient, and then for each of these clients we have a Go routine in which we call SetImageFromBytes and GetBoundingBoxes on a client. Each of these threads are consuming a lot of documents that they are processing one at a time (so 4 at a time sometimes).

This seemingly corrupts stderr so that we lose all the logs in our system.

Is it just not possible to run 4 seperate TessBaseAPIs at once?

We are using ENV OMP_THREAD_LIMIT=1 in our Dockerfile and our pod has 4 cores.

We also have RUN CGO_ENABLED=1 GOOS=linux GOARCH=amd64 go build -o /build/bin/service main.go

Our image base is debian:12

otiai10 commented 1 week ago

That's because gosseract currently hijacks stderr. This was a workaround when we implemented gosseract, and we don't believe this is the best way. We need to identify the best way. Meanwhile, I'm thinking about opt-out the stderr hijack. What do you think?