Closed jbarlow83 closed 6 years ago
OCRmyPDF is generally extremely slow with Tesseract 4, even with SSE / AVX / AVX2. This is caused by excessive multithreading. OCRmyPDF runs one tesseract process per CPU core by default, and each tesseract process uses four threads most of the time. That means four threads per CPU core, resulting in a large overhead for thread context switches.
The performance improves dramatically as soon as the tesseract processes are run single threaded. This currently requires setting an environment variable (OMP_THREAD_LIMIT=1
) which could be done by OCRmyPDF.
Tesseract 3 did not use multithreading, so it did not have that problem.
@stweil ocrmypdf actually runs no more than two Tesseract v4 processes at a time: https://github.com/jbarlow83/OCRmyPDF/blob/master/ocrmypdf/pipeline.py#L1187
As such performance of env OMP_THREAD_LIMIT=1 ocrmypdf ...
will give worse results (time to completion), except on a dual core machine where OMP_THREAD_LIMIT=1 is coincidentally optimal. See here on a big box:
$ # 16 core machine
$ # baseline
$ time ocrmypdf -j16 -f --output-type pdf 000173.pdf /dev/null
real 4m7.545s
user 16m52.524s
sys 0m30.456s
$ # changed
$ env OMP_THREAD_LIMIT=1 time ocrmypdf -j16 -f --output-type pdf 000173.pdf /dev/null
real 4m54.84s
user 10m17.94s
sys 0m24.94s
Anyway your remarks did inspire to investigate performance further. I tried a variety of options under the constraint (OMP_THREAD_LIMIT) * (jobs_limit) = 16 on this 16 core Linux, and the best result comes from removing the jobs_limit and setting OMP_THREAD_LIMIT=1. A bit unexpectedly, Tesseract seems more performant when single threaded processes are run in parallel. I'll make that change in a future release. Although I want to switch to multipage batching since that also improves Tesseract's performance, and perhaps here OpenMP is more helpful.
Separate from this, I think the "ruffus" multiprocessing library has a lock contention issue that prevents full exploitation of CPU cores. ocrmypdf seems "IPC bound" on big machines (since running multiple instances of ocrmypdf will fully exploit the machine).
For the AVX2 issue, I found this while exploring sporadic Debian CI failures. In the failure state the test suite fails after 75 minutes, in the success state it passes in 15, under the same OpenMP conditions (variable unset). I replicated this on macOS by compiling Tesseract 4 with AVX2 stripped out. Note that the time to run the test suite changed very little between Tesseract 3 to 4 – Dec 19 was the day Debian sid switched to Tesseract 4; the successful test suite went from ~10 to ~15 minutes. This is to illustrate that the multithreading performance issue you raised is independent of the AVX2 issue.
I just have run a test on Debian with ocrmypdf 5.5-2 and tesseract-ocr 4.00~git2219-40f43111-1.2 using a CPU with 4 cores. The default aborts with a timeout after 26 minutes. That's what most users will get. Restricting ocrmypdf to a single job while tesseract uses its default of 4 threads works in 4:28 minutes, while limiting tesseract to a single thread gives the fastest result in 4:02 minutes.
time ocrmypdf -l deu+eng --jobs 1 --force-ocr input.pdf output1.pdf
real 4m28,673s
user 10m54,900s
sys 0m14,636s
export OMP_THREAD_LIMIT=1
time ocrmypdf -l deu+eng --force-ocr input.pdf output2.pdf
real 4m2,270s
user 7m37,168s
sys 0m12,004s
unset OMP_THREAD_LIMIT
time ocrmypdf -l deu+eng --force-ocr input.pdf output2.pdf
[aborts with TimeoutExpired]
real 26m25,923s
user 103m19,652s
sys 0m36,268s
@jbarlow83, I assumed that ocrmypdf starts one process per core because of this information in the man page:
-j N, --jobs N
Use up to N CPU cores simultaneously (default: use all).
@stweil In v6.0.0 I implemented OMP_THREAD_LIMIT=1 and removal of the jobs_limit
on the number of Tesseract processes, since that combination seemed to yield an improvement everywhere.
It generally seems better to parallelize Tesseract than rely on OpenMP. The only exception I have found is that Tesseract with OpenMP is more CPU-efficient if given a large page list (although not faster). I haven't check high pixel count images, i.e. much larger than standard US letter/A4 page.
Note: description added late
Debian Bug Report Debian CI Logs