sirfz / tesserocr

A Python wrapper for the tesseract-ocr API
MIT License
1.99k stars 254 forks source link

Parallel requests increases time #229

Open oanamocean opened 4 years ago

oanamocean commented 4 years ago

Hey, I have an API using this code to predict text from different images but I'm having trouble understanding why the performance is so bad when I'm running multiple requests in parallel.

 def recognize_text(image, lang, psm=PSM.SINGLE_LINE):
    recognized_text = []
    with PyTessBaseAPI(psm=psm, lang=lang) as api:
        api.SetSourceResolution(300)
        api.SetPageSegMode(psm)
        api.SetImage(image)
        api.SetVariable("tessedit_do_invert", "0")
        api.Recognize()
        ri = api.GetIterator()
        for r in iterate_level(ri, RIL.TEXTLINE):
            try:
                result = r.GetUTF8Text(RIL.TEXTLINE)
                recognized_text.append(result)
            except RuntimeError as exception:
                logging.exception(exception)
        return recognized_text

If I run one request the time is around 2 seconds but if I start running 10 requests at the same time it gets to 40 seconds. I've read a lot about how to optimise and get better times, set different Tesseract variables and configurations but still couldn't find a solution for this. I've also set OMP_THREAD_LIMIT to 1 but it's not enough.

Any ideas about this?

sirfz commented 4 years ago

You're initializing the API for each request which probably adds a significant overhead. Try initializing a pool of PyTessBaseAPI instances and use them in each thread and see if that improves the run time.

bertsky commented 3 years ago

Also, because of GIL I recommend using multiprocessing instead of multithreading.

For the details it depends on whether you want to do batch processing (like on a bunch of files) or on demand processing (like in a server). For the former, see this example, for the latter, I recommend something based on mp.Queue and mp.Process.