Open arocketman opened 4 years ago
Calling tesserocr.image_to_text
will always initialize a new tesseract API instance for every call. A better way is to initialize the API instance first then call SetImage
+ GetUTF8Text
for each image. However, since tesseract API instances are not thread-safe, you're better off having a PyTessBaseAPI
instance for each thread and re-use them (you can create a synchronized pool for the api instances accessed by each thread for example).
Hi sirfz, thanks for the response I appreciate your time. As I suspected this was the case, I tried something similar to what you were suggesting, but unfortunately, I am still not getting good performances.
I have used a python queue to synchronize the APIs.
import concurrent.futures
import queue
import time
import tesserocr
from pdf2image import convert_from_bytes
NUM_THREADS = 4
tesserocr_queue = queue.Queue()
def perform_ocr(img):
tess_api = None
try:
print('Perform OCR started.')
tess_api = tesserocr_queue.get(block=True, timeout=300)
print('Api Acquired')
tess_api.SetImage(img)
text = tess_api.GetUTF8Text()
print('OCR performed')
return text
except tesserocr_queue.Empty:
print('Empty exception caught!')
return None
finally:
if tess_api is not None:
tesserocr_queue.put(tess_api)
print('Api released')
if __name__ == '__main__':
# Initialize API queue
for _ in range(NUM_THREADS):
tesserocr_queue.put(tesserocr.PyTessBaseAPI())
start = time.time()
# Pdf to image
with open('prova.pdf', 'rb') as raw_pdf:
ocr_entities = convert_from_bytes(raw_pdf.read(), dpi=500, thread_count=NUM_THREADS)
# Perform OCR using ThreadPoolExecutor
with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
res = executor.map(perform_ocr, ocr_entities)
end = time.time()
print('took tot: ' + str(end - start))
for _ in range(NUM_THREADS):
api = tesserocr_queue.get(block=True)
api.End()
While the synchronization seems to be working fine (no race conditions and thread await for the release of the API object before using it), I still am getting a performance loss over this simpler example:
api = tesserocr.PyTessBaseAPI()
start = time.time()
# Pdf to image
with open('prova.pdf', 'rb') as raw_pdf:
ocr_entities = convert_from_bytes(raw_pdf.read(), dpi=500, thread_count=NUM_THREADS)
res = []
for img in ocr_entities:
print('Starting image')
api.SetImage(img)
res.append(api.GetUTF8Text())
end = time.time()
print('took tot: ' + str(end - start))
api.End()
Here are my questions:
Thanks again for your time
Your code seems fine to me, have you tried tweaking the number of threads to see if you can find the optimal setting? I'd assume you can benefit from concurrency when there's heavy processing involved, it's possible that for the images you're processing there's simply no gain in concurrency as the processing time is already very low.
You can always try multiprocessing instead but I don't think it's gonna be any different, theoretically.
One thing I'd like to point out, although is likely not to be the issue here, is that you're including the image loading in your timing which introduces some noise into your benchmarks. I suggest timing the image processing part only.
Personally, my goal at the time I wrote tesserocr was to have something faster than pytesseract so I never tried to optimize multithreaded vs sequential. Can't really answer your last question.
Hi sirfz, thanks for getting back.
I refactored my code a little to take your considerations into account, here are the takeaways:
Here's the gist for the new code: https://gist.github.com/arocketman/b74050b87a2c763e3023a1142dd70090
I am a little out of ideas here unfortunately.
If every image takes 10 seconds processing then I think that's your main bottleneck and not concurrency. Is you CPU fully busy while images are being processed? If not then you can add a lot more threads to process as much images concurrently as possible but you're basically bound by memory and CPU at this stage.
Hi sirfz, this might be because I am doing these tests on my -not so powerful- laptop. Point taken tho.
I have the chance to try the same code in different environments, I shall do so and come back to this issue thread soon.
Thanks!
Hi sirfz, I re-did the same tests on a completely different environment (windows based machine and little more powerful). And it seems like I finally got the speed up I was looking for by using the exact same code. I am not completely sure what is the determining factor causing my first environment to not gain the speed-up, but a far fetched guess might be related to how the multi-threading is handled by tesseract itself.
Anyhow, I will present some results just for the future readers, I have tried the following benchmark:
6 total runs of the code published before (see my previous comments) using different combination of threads (1 to 10 to be precise) on a 10 paged document sized 260kbs (not too much to be honest, I find the overall result to be still slow but it might be my system).
Here are the results (time in seconds):
NumThreads | AVERAGE of time | STDEV of time | MEDIAN of time |
---|---|---|---|
1 | 93.85204608 | 1.782516798 | 93.70995784 |
2 | 32.15295778 | 13.28714628 | 27.14048362 |
3 | 12.65110448 | 5.623192528 | 10.14172578 |
4 | 10.33893752 | 1.606349404 | 10.31965017 |
5 | 12.51637963 | 2.98548617 | 13.32367992 |
6 | 10.2195756 | 2.370175547 | 10.22318089 |
7 | 13.59023993 | 3.803012242 | 15.10412645 |
8 | 11.89145529 | 2.999152741 | 10.78427005 |
9 | 11.4889638 | 1.825135847 | 11.27735865 |
10 | 11.26923306 | 2.33779279 | 10.58122993 |
It is interesting to see that it kind of caps at three threads (I expected four as the number of cores available on my CPU). However, I find the result to be satisfying since it's quite the speed-up from 1 threaded ocr (which is, by the way, extremely slow for some reason).
If you think it would be helpful to other people, I could open a PR with the code above (with some polishing) and place it in an examples folder.
Thanks for your support!
Hi @arocketman, can you share your final code for this issue or/and make a PR like you suggested? I think I'm not the only one who have problems with threads implementation, so everyone will learn from your work.
Thanks in advance !
I noticed that converting to grayscale (convert_from_bytes(raw_pdf.read(), dpi=300, thread_count=4, grayscale=True)
) gained another ~10% decrease in time, using @arocketman 's code from here.
@sirfz ,
You can always try multiprocessing instead but I don't think it's gonna be any different, theoretically.
I disagree. Python's global interpreter lock (GIL) should prevent any true parallelism theoretically – so preemptive multitasking would only yield a speedup by better multiplexing blocking I/O operations.
I would recommend multiprocessing, see #229.
(However, there seems to be another option at least here in this case, cf. Cython multithreading...)
@arocketman ,
I am not completely sure what is the determining factor causing my first environment to not gain the speed-up, but a far fetched guess might be related to how the multi-threading is handled by tesseract itself.
That might absolutely be the case: in Tesseract proper it has been repeatedly discussed (and finally documented) that the OpenMP implementation usually wastes performance (even if OMP_THREAD_LIMIT=1
to some extent) and should therefore not be compiled in when building libtesseract.
Regarding your measurements, since the figures you presented are wall-clock time, it would probably help to complement these with CPU-time seconds and CPU usage (like /usr/bin/time
does or time.process_time()
provides).
I disagree. Python's global interpreter lock (GIL) should prevent any true parallelism theoretically – so preemptive multitasking would only yield a speedup by better multiplexing blocking I/O operations.
I would recommend multiprocessing, see #229.
(However, there seems to be another option at least here in this case, cf. Cython multithreading...)
@sirfz, sorry, I guess I was wrong. You already meticulously placed with nogil
everywhere years ago. I guess the additional functionality described in the above link (cython.parallel.prange
and cython.parallel.parallel
) is not strictly necessary and can be approximated via Python facilities as well.
But the intial stumbling point here was the API initialization for each request. IMO the current README contains a formulation that sends users right on the wrong path:
Yes that's why I said theoretically multithreading should give decent0 results (close to multiprocessing) because the gil is being released by tesserocr (for some API methods).
Indeed, although image_to_text
and file_to_text
are convenient and provide decent parallel performance using threading it is always best to maintain your own pool of TessBaseAPI
and re-use them to avoid initializing a new instance for every request.
Hi. I am trying to use concurrent execution to parallelise OCR. I am following the approach @arocketman suggested but as in the case with his first machine, I am getting a slowdown instead of a speedup.
I have tweaked the number of threads in my pool and it seems that the execution time and the number of threads are positively correlated.
I also need to mention that my images contain up to around 50 words in them and are small. Here are some of the results I had while tweaking the number of threads.
No concurrency | 1Thread | 2Thread | 3Thread | 4Thread | Num_Images |
---|---|---|---|---|---|
0.14638 | 0.16825 | 0.18356 | 0.23780 | 0.24445 | 10 |
1.53653 | 1.83914 | 2.11177 | 2.70460 | 2.66555 | 30 |
1.75414 | 2.69041 | 3.13779 | 4.37517 | 4.44549 | 136 |
3.40942 | 5.67881 | 5.99834 | 11.05777 | 11.19764 | 204 |
14.10713 | 17.74640 | 22.28873 | 32.59760 | 30.43171 | 320 |
Hello! I need some help with this:
I have a 3 step process:
I am confused with how to parallelize this whole process. I have realized this can be done in a lot of ways and I am not sure which way to go.
Run a separate process for doing OCR and use your code, and pass in all OCR calls to this process. [I don't have clarity on how to implement this].
Please let me know if you need some additional info or redirect me to some resource which I can follow to get clarity on this.
Thank you :-)
Hi, I have the following code:
Using the commented part (a simple for loop) it works faster than the ThreadPoolExecutor along with map. I am not sure why this is happening, but we are talking 40seconds difference. Is this the proper way to do it or I am doing something wrong?
Another option I was exploring is having 4 classes with 4 different API objects and use a python Queue to handle the access to said classes' instances.
Do you have any suggestions or even better an example on how you used tesserocr with multithreading or multiprocess?
Thanks for your help!