ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
14.13k stars 1.02k forks source link

Test suite too slow with Tesseract 4 on machines without AVX2 #217

Closed jbarlow83 closed 6 years ago

jbarlow83 commented 6 years ago

Note: description added late

Debian Bug Report Debian CI Logs

stweil commented 6 years ago

OCRmyPDF is generally extremely slow with Tesseract 4, even with SSE / AVX / AVX2. This is caused by excessive multithreading. OCRmyPDF runs one tesseract process per CPU core by default, and each tesseract process uses four threads most of the time. That means four threads per CPU core, resulting in a large overhead for thread context switches.

The performance improves dramatically as soon as the tesseract processes are run single threaded. This currently requires setting an environment variable (OMP_THREAD_LIMIT=1) which could be done by OCRmyPDF.

Tesseract 3 did not use multithreading, so it did not have that problem.

jbarlow83 commented 6 years ago

@stweil ocrmypdf actually runs no more than two Tesseract v4 processes at a time: https://github.com/jbarlow83/OCRmyPDF/blob/master/ocrmypdf/pipeline.py#L1187

As such performance of env OMP_THREAD_LIMIT=1 ocrmypdf ... will give worse results (time to completion), except on a dual core machine where OMP_THREAD_LIMIT=1 is coincidentally optimal. See here on a big box:

$ # 16 core machine
$ # baseline
$ time ocrmypdf -j16 -f --output-type pdf 000173.pdf /dev/null
real    4m7.545s
user    16m52.524s
sys 0m30.456s

$ # changed
$ env OMP_THREAD_LIMIT=1 time ocrmypdf -j16 -f --output-type pdf 000173.pdf /dev/null
real    4m54.84s
user    10m17.94s
sys 0m24.94s

Anyway your remarks did inspire to investigate performance further. I tried a variety of options under the constraint (OMP_THREAD_LIMIT) * (jobs_limit) = 16 on this 16 core Linux, and the best result comes from removing the jobs_limit and setting OMP_THREAD_LIMIT=1. A bit unexpectedly, Tesseract seems more performant when single threaded processes are run in parallel. I'll make that change in a future release. Although I want to switch to multipage batching since that also improves Tesseract's performance, and perhaps here OpenMP is more helpful.

Separate from this, I think the "ruffus" multiprocessing library has a lock contention issue that prevents full exploitation of CPU cores. ocrmypdf seems "IPC bound" on big machines (since running multiple instances of ocrmypdf will fully exploit the machine).

For the AVX2 issue, I found this while exploring sporadic Debian CI failures. In the failure state the test suite fails after 75 minutes, in the success state it passes in 15, under the same OpenMP conditions (variable unset). I replicated this on macOS by compiling Tesseract 4 with AVX2 stripped out. Note that the time to run the test suite changed very little between Tesseract 3 to 4 – Dec 19 was the day Debian sid switched to Tesseract 4; the successful test suite went from ~10 to ~15 minutes. This is to illustrate that the multithreading performance issue you raised is independent of the AVX2 issue.

Debian Bug Report Debian CI Logs

stweil commented 6 years ago

I just have run a test on Debian with ocrmypdf 5.5-2 and tesseract-ocr 4.00~git2219-40f43111-1.2 using a CPU with 4 cores. The default aborts with a timeout after 26 minutes. That's what most users will get. Restricting ocrmypdf to a single job while tesseract uses its default of 4 threads works in 4:28 minutes, while limiting tesseract to a single thread gives the fastest result in 4:02 minutes.

time ocrmypdf -l deu+eng --jobs 1 --force-ocr input.pdf output1.pdf
real    4m28,673s
user    10m54,900s
sys 0m14,636s

export OMP_THREAD_LIMIT=1
time ocrmypdf -l deu+eng --force-ocr input.pdf output2.pdf
real    4m2,270s
user    7m37,168s
sys 0m12,004s

unset OMP_THREAD_LIMIT
time ocrmypdf -l deu+eng --force-ocr input.pdf output2.pdf
[aborts with TimeoutExpired]
real    26m25,923s
user    103m19,652s
sys 0m36,268s
stweil commented 6 years ago

@jbarlow83, I assumed that ocrmypdf starts one process per core because of this information in the man page:

   -j N, --jobs N
          Use up to N CPU cores simultaneously (default: use all).
jbarlow83 commented 6 years ago

@stweil In v6.0.0 I implemented OMP_THREAD_LIMIT=1 and removal of the jobs_limit on the number of Tesseract processes, since that combination seemed to yield an improvement everywhere.

It generally seems better to parallelize Tesseract than rely on OpenMP. The only exception I have found is that Tesseract with OpenMP is more CPU-efficient if given a large page list (although not faster). I haven't check high pixel count images, i.e. much larger than standard US letter/A4 page.