Running multiple tesseract instances in parallel is slower than running in serial

ole-tange commented 4 years ago

Environment

Tesseract Version: tesseract 5.0.0-alpha-781-gb19e3
Commit Number: b19e3ee63c4afe207676e3e1b3211f52909f8d48
Platform: Linux travel 5.4.0-45-lowlatency #49-Ubuntu SMP PREEMPT Wed Aug 26 15:06:34 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Current Behavior:

It seems running multiple tesseract instances in parallel is ridiculously much slower than running a single instance which in turn is much slower than running multiple tesseract instances in parallel with OMP_THREAD_LIMIT=1.

Based on https://twitter.com/ghalfacree/status/1310533844907577344 I tested if this issue is related to GNU Parallel or if it also fails when using xargs. It does.

single-threaded-tesseract in parallel < single multi-threaded-tesseract < multi-threaded-tesseract in parallel

9 seconds < 16 seconds < 459 seconds

As you can see the multi-threaded tesseract in parallel is waaaay slower by a huge margin.

$ ls foo*jpg |head | time xargs -I{} -n1 /usr/local/bin/tesseract {} foo
Tesseract Open Source OCR Engine v5.0.0-alpha-781-gb19e3 with Leptonica
:
Tesseract Open Source OCR Engine v5.0.0-alpha-781-gb19e3 with Leptonica
44.29user 1.05system 0:16.43elapsed 275%CPU (0avgtext+0avgdata 83392maxresident)k
460538inputs+0outputs (0major+311928minor)pagefaults 0swaps

$ ls foo*jpg |head | time xargs -P4 -I{} -n1 /usr/local/bin/tesseract {} foo
Tesseract Open Source OCR Engine v5.0.0-alpha-781-gb19e3 with Leptonica
:
Tesseract Open Source OCR Engine v5.0.0-alpha-781-gb19e3 with Leptonica
1416.10user 6.00system 7:39.97elapsed 309%CPU (0avgtext+0avgdata 83376maxresident)k
460538inputs+0outputs (0major+311900minor)pagefaults 0swaps

$ export OMP_THREAD_LIMIT=1
$ ls foo*jpg |head | time xargs -P4 -I{} -n1 /usr/local/bin/tesseract {} foo
Tesseract Open Source OCR Engine v5.0.0-alpha-781-gb19e3 with Leptonica
:
Tesseract Open Source OCR Engine v5.0.0-alpha-781-gb19e3 with Leptonica
25.30user 1.28system 0:09.53elapsed 278%CPU (0avgtext+0avgdata 83364maxresident)k
460538inputs+0outputs (0major+311855minor)pagefaults 0swaps

Expected Behavior:

I had expected the multithreaded version to be around as fast as the single threaded version when both are run in parallel.

Suggested Fix:

Tell the user to use GNU Parallel/xargs + OMP_THREAD_LIMIT=1 when running multiple instances in parallel.

stweil commented 4 years ago

This is nothing new. For mass production, I even suggest to build Tesseract without multithreading support (see the documentation).

With multithreading enabled, a single tesseract process will run up to 4 threads, so 2 processes will start 8 threads. If you are running on a machine which cannot run that number of threads at the same time, there will be a huge overhead for thread scheduling.

As multithreading is only used for some parts of the OCR process, running parallel singlethreaded tesseract processes makes the best use of your available CPU cores.

sarathcp commented 4 years ago

From tesseract 4.1, I could see OpenMP is disabled by default. Does it mean without OMP_THREAD_LIMIT it is always single threaded? Do we need to enable OpenMP inorder to get effect of OMP_THREAD_LIMIT?

ole-tange commented 4 years ago

With multithreading enabled, a single tesseract process will run up to 4 threads, so 2 processes will start 8 threads. If you are running on a machine which cannot run that number of threads at the same time, there will be a huge overhead for thread scheduling.

This cannot explain the massive increase in itself.

nthreads=$(parallel --number-of-threads)

# Run 10 jobs per cpu thread in total
# Run 1 job per cpu thread in parallel
seq ${nthreads}0 | time parallel -N0 'seq 1000000 | bzip2 -9 >/dev/null'
83.05user 3.89system 0:16.31elapsed 532%CPU (0avgtext+0avgdata 18964maxresident)k
0inputs+29192outputs (0major+234615minor)pagefaults 0swaps

# Run 10 jobs per cpu thread in total
# Run 10 jobs per cpu thread in parallel
seq ${nthreads}0 | time parallel -j0 -N0 'seq 1000000 | bzip2 -9 >/dev/null'
85.14user 4.34system 0:16.75elapsed 534%CPU (0avgtext+0avgdata 19896maxresident)k
0inputs+15288outputs (0major+264817minor)pagefaults 0swaps

Example 1 runs one thread per cpu thread. Example 2 runs 10 threads per cpu thread.

Thus this should be more extreme than tesseract that only starts 4 threads. Yet the extra time spent is less than 10%

Something else is going on here. My best guess is that tesseract detects the size of CPU cache and optimized some processing to this. So if multiple threads use the cache, the cache will be invalidated all the time, thus not benefitting at all.

Does this sound like a plausible explanation?

stweil commented 4 years ago

tesseract detects the size of CPU cache and optimized some processing to this

There is no code in Tesseract which detects the cache size.

ole-tange commented 4 years ago

I did a bit of testing of the git version on my 64C64T server. The input was 174 images.

As you can see the multi-threaded (MT, default) is faster than running multiple single threaded instances (ST) up to around 20 processes in parallel. At 1 process MT is quite a bit faster than ST.

Between 20 and 30 it is a toss-up, but > 30 processes ST is faster. As expected ST is best around 64, but is not much worse when going to 128. The graph only goes to 90, but the ST stays low up to at least 128.

This confirms the experience I have from my 2C4T laptop: MT is faster if I run a single instance, ST is faster if I run more instances. On the laptop I am probably only experiencing the values that correspond to 1 process and 64 processes in the graph.

This means: If you are doing batch processing, you should use ST in parallel - one for each CPU thread. If you are doing a single image and need the result fast you should use MT.

It does not explain the root cause, though.

strace suggests futex is causing the problem: There are a lot of them and half of them return: "EAGAIN (Resource temporarily unavailable)". This does not happen when running a single MT process: Then futex returns 0. That can explain the slowdown: If there is some shared resource that all of the running processes need, then that will be a problem the more processes that are running. But each parallel process should not interfere with a futex of other processes. So I do not understand why two different processes would slow eachother down.

I have tested if it makes a difference if the processes are run by different users. The ideas is that user One and user Two cannot read each others memory, so processes run by One should not interfere with processes run by Two.

So I ran 30 processes as user One and 30 more as user Two - i.e. in total 60 MT in parallel. This took 2000 seconds. Given the graph below I had expected in the order of either 2 600 seconds (if the shared resource is shared across users) or 2 60 seconds (if the shared ressource is not shared across users), so this was a bit of a surprise to me.

Running 2 30 ST in parallel as user One and user Two takes: 46 seconds (228 seconds was expected from the graph), so this is a bit faster, though less drastic.

Would it be possible for Tesseract to signal to other instances if it is running? E.g. by opening a socket or setting a flag that other instances can see. So if that flag is raised, the next instance will be running ST.

Processes in parallel vs. time in seconds:

stweil commented 4 years ago

Would it be possible for Tesseract to signal to other instances if it is running?

Sure, that would not be too difficult to implement. We could also test the system load and limit the number of threads accordingly (like GNU make can do for parallel builds).

Nevertheless there remains some smaller overhead with OpenMP code, even if Tesseract runs single threaded. That's why I build it without MP support for mass production.

ole-tange commented 4 years ago

If possible I think you should optimize for 2 scenarios:

Running a single process
Running one process per CPU thread

and preferably Tesseract should automatically determine whether it is running in one or the other scenario.

All the scenarios in between are more of theoretical use.

I understand that I can compile my own optimized binary, but I do not think most people are ready to do that.

tesseract-ocr / tesseract