Using ThreadPoolExecutor with image_to_text makes it slower instead of having a speedup

arocketman commented 4 years ago

Hi, I have the following code:

start = time.time()
ocr_entities = []

with open('prova.pdf', 'rb') as raw_pdf:
    ocr_entities = convert_from_bytes(raw_pdf.read(), dpi=500, thread_count=4)

#for image in ocr_entities:
#    tesserocr.image_to_text(image)

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
    executor.map(tesserocr.image_to_text, ocr_entities)

end = time.time()
print('took toto: ' + str(end - start))

Using the commented part (a simple for loop) it works faster than the ThreadPoolExecutor along with map. I am not sure why this is happening, but we are talking 40seconds difference. Is this the proper way to do it or I am doing something wrong?

Another option I was exploring is having 4 classes with 4 different API objects and use a python Queue to handle the access to said classes' instances.

Do you have any suggestions or even better an example on how you used tesserocr with multithreading or multiprocess?

Thanks for your help!

sirfz commented 4 years ago

Calling tesserocr.image_to_text will always initialize a new tesseract API instance for every call. A better way is to initialize the API instance first then call SetImage + GetUTF8Text for each image. However, since tesseract API instances are not thread-safe, you're better off having a PyTessBaseAPI instance for each thread and re-use them (you can create a synchronized pool for the api instances accessed by each thread for example).

arocketman commented 4 years ago

Hi sirfz, thanks for the response I appreciate your time. As I suspected this was the case, I tried something similar to what you were suggesting, but unfortunately, I am still not getting good performances.

I have used a python queue to synchronize the APIs.

import concurrent.futures
import queue
import time

import tesserocr
from pdf2image import convert_from_bytes

NUM_THREADS = 4
tesserocr_queue = queue.Queue()

def perform_ocr(img):
    tess_api = None
    try:
        print('Perform OCR started.')
        tess_api = tesserocr_queue.get(block=True, timeout=300)
        print('Api Acquired')
        tess_api.SetImage(img)
        text = tess_api.GetUTF8Text()
        print('OCR performed')
        return text
    except tesserocr_queue.Empty:
        print('Empty exception caught!')
        return None
    finally:
        if tess_api is not None:
            tesserocr_queue.put(tess_api)
            print('Api released')

if __name__ == '__main__':
    # Initialize API queue
    for _ in range(NUM_THREADS):
        tesserocr_queue.put(tesserocr.PyTessBaseAPI())

    start = time.time()

    # Pdf to image
    with open('prova.pdf', 'rb') as raw_pdf:
        ocr_entities = convert_from_bytes(raw_pdf.read(), dpi=500, thread_count=NUM_THREADS)

    # Perform OCR using ThreadPoolExecutor
    with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
        res = executor.map(perform_ocr, ocr_entities)

    end = time.time()
    print('took tot: ' + str(end - start))

    for _ in range(NUM_THREADS):
        api = tesserocr_queue.get(block=True)
        api.End()

While the synchronization seems to be working fine (no race conditions and thread await for the release of the API object before using it), I still am getting a performance loss over this simpler example:

    api = tesserocr.PyTessBaseAPI()
    start = time.time()

    # Pdf to image
    with open('prova.pdf', 'rb') as raw_pdf:
        ocr_entities = convert_from_bytes(raw_pdf.read(), dpi=500, thread_count=NUM_THREADS)

    res = []
    for img in ocr_entities:
        print('Starting image')
        api.SetImage(img)
        res.append(api.GetUTF8Text())

    end = time.time()
    print('took tot: ' + str(end - start))
    api.End()

Here are my questions:

Do you think my first piece of code using ThreadPoolExecutor is correct?
Would it be better to use ProcessPoolExecutor over ThreadPoolExecutor?
Would it be possible for you to provide a code snippet you have used in the past to perform parallelization and that brought you a speedup?

Thanks again for your time

sirfz commented 4 years ago

Your code seems fine to me, have you tried tweaking the number of threads to see if you can find the optimal setting? I'd assume you can benefit from concurrency when there's heavy processing involved, it's possible that for the images you're processing there's simply no gain in concurrency as the processing time is already very low.

You can always try multiprocessing instead but I don't think it's gonna be any different, theoretically.

One thing I'd like to point out, although is likely not to be the issue here, is that you're including the image loading in your timing which introduces some noise into your benchmarks. I suggest timing the image processing part only.

Personally, my goal at the time I wrote tesserocr was to have something faster than pytesseract so I never tried to optimize multithreaded vs sequential. Can't really answer your last question.

arocketman commented 4 years ago

Hi sirfz, thanks for getting back.

I refactored my code a little to take your considerations into account, here are the takeaways:

Changing the number of threads doesn't seem to affect performance too much. Of course these performance tests also should be done N times to provide a statistically significant result
You have a point, the image loading could somehow affect the test. I started by removing the pdf2image timing part to have a clearer difference
Indeed, pytesseract is slower vs tesserocr, however I didn't try the former in a multithreaded fashion.
The document I am currently using is around 1mb and 1 page takes around 10s to ocr on my laptop, so I think it should benefit from multi-threading as the overhead generated from the thread spawning should not overweight the processing time

Here's the gist for the new code: https://gist.github.com/arocketman/b74050b87a2c763e3023a1142dd70090

I am a little out of ideas here unfortunately.

sirfz commented 4 years ago

If every image takes 10 seconds processing then I think that's your main bottleneck and not concurrency. Is you CPU fully busy while images are being processed? If not then you can add a lot more threads to process as much images concurrently as possible but you're basically bound by memory and CPU at this stage.

arocketman commented 4 years ago

Hi sirfz, this might be because I am doing these tests on my -not so powerful- laptop. Point taken tho.

I have the chance to try the same code in different environments, I shall do so and come back to this issue thread soon.

Thanks!

arocketman commented 4 years ago

Hi sirfz, I re-did the same tests on a completely different environment (windows based machine and little more powerful). And it seems like I finally got the speed up I was looking for by using the exact same code. I am not completely sure what is the determining factor causing my first environment to not gain the speed-up, but a far fetched guess might be related to how the multi-threading is handled by tesseract itself.

Anyhow, I will present some results just for the future readers, I have tried the following benchmark:

6 total runs of the code published before (see my previous comments) using different combination of threads (1 to 10 to be precise) on a 10 paged document sized 260kbs (not too much to be honest, I find the overall result to be still slow but it might be my system).

Here are the results (time in seconds):

NumThreads	AVERAGE of time	STDEV of time	MEDIAN of time
1	93.85204608	1.782516798	93.70995784
2	32.15295778	13.28714628	27.14048362
3	12.65110448	5.623192528	10.14172578
4	10.33893752	1.606349404	10.31965017
5	12.51637963	2.98548617	13.32367992
6	10.2195756	2.370175547	10.22318089
7	13.59023993	3.803012242	15.10412645
8	11.89145529	2.999152741	10.78427005
9	11.4889638	1.825135847	11.27735865
10	11.26923306	2.33779279	10.58122993

It is interesting to see that it kind of caps at three threads (I expected four as the number of cores available on my CPU). However, I find the result to be satisfying since it's quite the speed-up from 1 threaded ocr (which is, by the way, extremely slow for some reason).

If you think it would be helpful to other people, I could open a PR with the code above (with some polishing) and place it in an examples folder.

Thanks for your support!

alekseiancheruk commented 4 years ago

Hi @arocketman, can you share your final code for this issue or/and make a PR like you suggested? I think I'm not the only one who have problems with threads implementation, so everyone will learn from your work.

Thanks in advance !

FieteO commented 3 years ago

I noticed that converting to grayscale (convert_from_bytes(raw_pdf.read(), dpi=300, thread_count=4, grayscale=True)) gained another ~10% decrease in time, using @arocketman 's code from here.

bertsky commented 3 years ago

@sirfz ,

You can always try multiprocessing instead but I don't think it's gonna be any different, theoretically.

I disagree. Python's global interpreter lock (GIL) should prevent any true parallelism theoretically – so preemptive multitasking would only yield a speedup by better multiplexing blocking I/O operations.

I would recommend multiprocessing, see #229.

(However, there seems to be another option at least here in this case, cf. Cython multithreading...)

@arocketman ,

I am not completely sure what is the determining factor causing my first environment to not gain the speed-up, but a far fetched guess might be related to how the multi-threading is handled by tesseract itself.

That might absolutely be the case: in Tesseract proper it has been repeatedly discussed (and finally documented) that the OpenMP implementation usually wastes performance (even if OMP_THREAD_LIMIT=1 to some extent) and should therefore not be compiled in when building libtesseract.

Regarding your measurements, since the figures you presented are wall-clock time, it would probably help to complement these with CPU-time seconds and CPU usage (like /usr/bin/time does or time.process_time() provides).

bertsky commented 3 years ago

I disagree. Python's global interpreter lock (GIL) should prevent any true parallelism theoretically – so preemptive multitasking would only yield a speedup by better multiplexing blocking I/O operations.

I would recommend multiprocessing, see #229.

(However, there seems to be another option at least here in this case, cf. Cython multithreading...)

@sirfz, sorry, I guess I was wrong. You already meticulously placed with nogil everywhere years ago. I guess the additional functionality described in the above link (cython.parallel.prange and cython.parallel.parallel) is not strictly necessary and can be approximated via Python facilities as well.

But the intial stumbling point here was the API initialization for each request. IMO the current README contains a formulation that sends users right on the wrong path:

https://github.com/sirfz/tesserocr/blob/711cbab544dbb4bd3dcf1f13aad9d0fef20fcac7/README.rst#L164-L165

sirfz commented 3 years ago

Yes that's why I said theoretically multithreading should give decent0 results (close to multiprocessing) because the gil is being released by tesserocr (for some API methods).

Indeed, although image_to_text and file_to_text are convenient and provide decent parallel performance using threading it is always best to maintain your own pool of TessBaseAPI and re-use them to avoid initializing a new instance for every request.

dukaenea commented 2 years ago

Hi. I am trying to use concurrent execution to parallelise OCR. I am following the approach @arocketman suggested but as in the case with his first machine, I am getting a slowdown instead of a speedup.

I have tweaked the number of threads in my pool and it seems that the execution time and the number of threads are positively correlated.

I also need to mention that my images contain up to around 50 words in them and are small. Here are some of the results I had while tweaking the number of threads.

No concurrency	1Thread	2Thread	3Thread	4Thread	Num_Images
0.14638	0.16825	0.18356	0.23780	0.24445	10
1.53653	1.83914	2.11177	2.70460	2.66555	30
1.75414	2.69041	3.13779	4.37517	4.44549	136
3.40942	5.67881	5.99834	11.05777	11.19764	204
14.10713	17.74640	22.28873	32.59760	30.43171	320

adsk2050 commented 1 year ago

Hello! I need some help with this:

I have a 3 step process:

Extract pages from pdf document
For each page image - Extract contours (say 30 per page) from each page
For each contour - Crop out 5 ROIs from each contour and run OCR on them separately.

I am confused with how to parallelize this whole process. I have realized this can be done in a lot of ways and I am not sure which way to go.

Level 1: Run one process for each document, multiple documents in parallel. Don't parallelize others.
Level 2: Run one process for each page of the document.
Level 3: Run one process for each contour extracted from the page.

Run a separate process for doing OCR and use your code, and pass in all OCR calls to this process. [I don't have clarity on how to implement this].

Please let me know if you need some additional info or redirect me to some resource which I can follow to get clarity on this.

Thank you :-)

sirfz / tesserocr

Using ThreadPoolExecutor with image_to_text makes it slower instead of having a speedup #205