sirfz / tesserocr

A Python wrapper for the tesseract-ocr API
MIT License
2.01k stars 253 forks source link

Using ThreadPoolExecutor with image_to_text makes it slower instead of having a speedup #205

Open arocketman opened 4 years ago

arocketman commented 4 years ago

Hi, I have the following code:

start = time.time()
ocr_entities = []

with open('prova.pdf', 'rb') as raw_pdf:
    ocr_entities = convert_from_bytes(raw_pdf.read(), dpi=500, thread_count=4)

#for image in ocr_entities:
#    tesserocr.image_to_text(image)

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    executor.map(tesserocr.image_to_text, ocr_entities)

end = time.time()
print('took toto: ' + str(end - start))

Using the commented part (a simple for loop) it works faster than the ThreadPoolExecutor along with map. I am not sure why this is happening, but we are talking 40seconds difference. Is this the proper way to do it or I am doing something wrong?

Another option I was exploring is having 4 classes with 4 different API objects and use a python Queue to handle the access to said classes' instances.

Do you have any suggestions or even better an example on how you used tesserocr with multithreading or multiprocess?

Thanks for your help!

sirfz commented 4 years ago

Calling tesserocr.image_to_text will always initialize a new tesseract API instance for every call. A better way is to initialize the API instance first then call SetImage + GetUTF8Text for each image. However, since tesseract API instances are not thread-safe, you're better off having a PyTessBaseAPI instance for each thread and re-use them (you can create a synchronized pool for the api instances accessed by each thread for example).

arocketman commented 4 years ago

Hi sirfz, thanks for the response I appreciate your time. As I suspected this was the case, I tried something similar to what you were suggesting, but unfortunately, I am still not getting good performances.

I have used a python queue to synchronize the APIs.

import concurrent.futures
import queue
import time

import tesserocr
from pdf2image import convert_from_bytes

NUM_THREADS = 4
tesserocr_queue = queue.Queue()

def perform_ocr(img):
    tess_api = None
    try:
        print('Perform OCR started.')
        tess_api = tesserocr_queue.get(block=True, timeout=300)
        print('Api Acquired')
        tess_api.SetImage(img)
        text = tess_api.GetUTF8Text()
        print('OCR performed')
        return text
    except tesserocr_queue.Empty:
        print('Empty exception caught!')
        return None
    finally:
        if tess_api is not None:
            tesserocr_queue.put(tess_api)
            print('Api released')

if __name__ == '__main__':
    # Initialize API queue
    for _ in range(NUM_THREADS):
        tesserocr_queue.put(tesserocr.PyTessBaseAPI())

    start = time.time()

    # Pdf to image
    with open('prova.pdf', 'rb') as raw_pdf:
        ocr_entities = convert_from_bytes(raw_pdf.read(), dpi=500, thread_count=NUM_THREADS)

    # Perform OCR using ThreadPoolExecutor
    with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
        res = executor.map(perform_ocr, ocr_entities)

    end = time.time()
    print('took tot: ' + str(end - start))

    for _ in range(NUM_THREADS):
        api = tesserocr_queue.get(block=True)
        api.End()

While the synchronization seems to be working fine (no race conditions and thread await for the release of the API object before using it), I still am getting a performance loss over this simpler example:

    api = tesserocr.PyTessBaseAPI()
    start = time.time()

    # Pdf to image
    with open('prova.pdf', 'rb') as raw_pdf:
        ocr_entities = convert_from_bytes(raw_pdf.read(), dpi=500, thread_count=NUM_THREADS)

    res = []
    for img in ocr_entities:
        print('Starting image')
        api.SetImage(img)
        res.append(api.GetUTF8Text())

    end = time.time()
    print('took tot: ' + str(end - start))
    api.End()

Here are my questions:

Thanks again for your time

sirfz commented 4 years ago

Your code seems fine to me, have you tried tweaking the number of threads to see if you can find the optimal setting? I'd assume you can benefit from concurrency when there's heavy processing involved, it's possible that for the images you're processing there's simply no gain in concurrency as the processing time is already very low.

You can always try multiprocessing instead but I don't think it's gonna be any different, theoretically.

One thing I'd like to point out, although is likely not to be the issue here, is that you're including the image loading in your timing which introduces some noise into your benchmarks. I suggest timing the image processing part only.

Personally, my goal at the time I wrote tesserocr was to have something faster than pytesseract so I never tried to optimize multithreaded vs sequential. Can't really answer your last question.

arocketman commented 4 years ago

Hi sirfz, thanks for getting back.

I refactored my code a little to take your considerations into account, here are the takeaways:

Here's the gist for the new code: https://gist.github.com/arocketman/b74050b87a2c763e3023a1142dd70090

I am a little out of ideas here unfortunately.

sirfz commented 4 years ago

If every image takes 10 seconds processing then I think that's your main bottleneck and not concurrency. Is you CPU fully busy while images are being processed? If not then you can add a lot more threads to process as much images concurrently as possible but you're basically bound by memory and CPU at this stage.

arocketman commented 4 years ago

Hi sirfz, this might be because I am doing these tests on my -not so powerful- laptop. Point taken tho.

I have the chance to try the same code in different environments, I shall do so and come back to this issue thread soon.

Thanks!

arocketman commented 4 years ago

Hi sirfz, I re-did the same tests on a completely different environment (windows based machine and little more powerful). And it seems like I finally got the speed up I was looking for by using the exact same code. I am not completely sure what is the determining factor causing my first environment to not gain the speed-up, but a far fetched guess might be related to how the multi-threading is handled by tesseract itself.

Anyhow, I will present some results just for the future readers, I have tried the following benchmark:

6 total runs of the code published before (see my previous comments) using different combination of threads (1 to 10 to be precise) on a 10 paged document sized 260kbs (not too much to be honest, I find the overall result to be still slow but it might be my system).

Here are the results (time in seconds):

NumThreads AVERAGE of time STDEV of time MEDIAN of time
1 93.85204608 1.782516798 93.70995784
2 32.15295778 13.28714628 27.14048362
3 12.65110448 5.623192528 10.14172578
4 10.33893752 1.606349404 10.31965017
5 12.51637963 2.98548617 13.32367992
6 10.2195756 2.370175547 10.22318089
7 13.59023993 3.803012242 15.10412645
8 11.89145529 2.999152741 10.78427005
9 11.4889638 1.825135847 11.27735865
10 11.26923306 2.33779279 10.58122993

It is interesting to see that it kind of caps at three threads (I expected four as the number of cores available on my CPU). However, I find the result to be satisfying since it's quite the speed-up from 1 threaded ocr (which is, by the way, extremely slow for some reason).

If you think it would be helpful to other people, I could open a PR with the code above (with some polishing) and place it in an examples folder.

Thanks for your support!

alekseiancheruk commented 4 years ago

Hi @arocketman, can you share your final code for this issue or/and make a PR like you suggested? I think I'm not the only one who have problems with threads implementation, so everyone will learn from your work.

Thanks in advance !

FieteO commented 3 years ago

I noticed that converting to grayscale (convert_from_bytes(raw_pdf.read(), dpi=300, thread_count=4, grayscale=True)) gained another ~10% decrease in time, using @arocketman 's code from here.

bertsky commented 3 years ago

@sirfz ,

You can always try multiprocessing instead but I don't think it's gonna be any different, theoretically.

I disagree. Python's global interpreter lock (GIL) should prevent any true parallelism theoretically – so preemptive multitasking would only yield a speedup by better multiplexing blocking I/O operations.

I would recommend multiprocessing, see #229.

(However, there seems to be another option at least here in this case, cf. Cython multithreading...)

@arocketman ,

I am not completely sure what is the determining factor causing my first environment to not gain the speed-up, but a far fetched guess might be related to how the multi-threading is handled by tesseract itself.

That might absolutely be the case: in Tesseract proper it has been repeatedly discussed (and finally documented) that the OpenMP implementation usually wastes performance (even if OMP_THREAD_LIMIT=1 to some extent) and should therefore not be compiled in when building libtesseract.

Regarding your measurements, since the figures you presented are wall-clock time, it would probably help to complement these with CPU-time seconds and CPU usage (like /usr/bin/time does or time.process_time() provides).

bertsky commented 3 years ago

I disagree. Python's global interpreter lock (GIL) should prevent any true parallelism theoretically – so preemptive multitasking would only yield a speedup by better multiplexing blocking I/O operations.

I would recommend multiprocessing, see #229.

(However, there seems to be another option at least here in this case, cf. Cython multithreading...)

@sirfz, sorry, I guess I was wrong. You already meticulously placed with nogil everywhere years ago. I guess the additional functionality described in the above link (cython.parallel.prange and cython.parallel.parallel) is not strictly necessary and can be approximated via Python facilities as well.

But the intial stumbling point here was the API initialization for each request. IMO the current README contains a formulation that sends users right on the wrong path:

https://github.com/sirfz/tesserocr/blob/711cbab544dbb4bd3dcf1f13aad9d0fef20fcac7/README.rst#L164-L165

sirfz commented 3 years ago

Yes that's why I said theoretically multithreading should give decent0 results (close to multiprocessing) because the gil is being released by tesserocr (for some API methods).

Indeed, although image_to_text and file_to_text are convenient and provide decent parallel performance using threading it is always best to maintain your own pool of TessBaseAPI and re-use them to avoid initializing a new instance for every request.

dukaenea commented 2 years ago

Hi. I am trying to use concurrent execution to parallelise OCR. I am following the approach @arocketman suggested but as in the case with his first machine, I am getting a slowdown instead of a speedup.

I have tweaked the number of threads in my pool and it seems that the execution time and the number of threads are positively correlated.

I also need to mention that my images contain up to around 50 words in them and are small. Here are some of the results I had while tweaking the number of threads.

No concurrency 1Thread 2Thread 3Thread 4Thread Num_Images
0.14638 0.16825 0.18356 0.23780 0.24445 10
1.53653 1.83914 2.11177 2.70460 2.66555 30
1.75414 2.69041 3.13779 4.37517 4.44549 136
3.40942 5.67881 5.99834 11.05777 11.19764 204
14.10713 17.74640 22.28873 32.59760 30.43171 320
adsk2050 commented 1 year ago

Hello! I need some help with this:

I have a 3 step process:

I am confused with how to parallelize this whole process. I have realized this can be done in a lot of ways and I am not sure which way to go.

Run a separate process for doing OCR and use your code, and pass in all OCR calls to this process. [I don't have clarity on how to implement this].

Please let me know if you need some additional info or redirect me to some resource which I can follow to get clarity on this.

Thank you :-)