RFC: Tesseract performance

stweil commented 7 years ago

Performance is important for real time OCR, mass production OCR and training.

In this RFC I'd like to discuss performance bottlenecks and potential improvements.

According to my tests with Valgrind's tool callgrind, these functions have the largest computational costs (ordered by decreasing time):

memory allocations / deallocations
tesseract::UnicharIdArrayUtils::compare
tesseract::DotProductAVX (or the other implementations of the dot product)
vfprintf (called from snprintf)

tesseract::UnicharIdArrayUtils::compare and memory allocations / deallocations are also the functions which are called most often.

I recently had a closer look at the dot product calculations and noticed that at least some input vectors are converted from float to double (which takes time). The dot product is always done with double values (more expensive than float). If memory bandwidth is the limiting factor, using double means doubled time compared with float. The current code uses 4 parallel threads. I have run some timing tests without that parallelization and got nearly the same execution time. @theraysmith, did you try using float for the dot product, and do you get better performance from parallelization in that part of the OCR process?

theraysmith commented 7 years ago

Yes I've done quite a lot of performance vs accuracy testing.

Memory new/delete: The network system uses a rather complex scratchpad mechanism to minimize the number of allocations/deallocations. It helped a lot with both speed and peak memory. I'd be curious to know where the current bottleneck is is you have specifics of the stack where the allocations take place.

UnicharIdArrayUtils: Hmm. Looks like it may be doing too many id_to_unichar (). Again callers/stack traces would be useful.

snprintf: Really? where from? Are you running with some debug flags on?

DotProduct: Float vs double:

Running a large dot product strictly in float (ie float += (float = float * float)) is an unwise thing to do. (Ahem, no comment on any other nn libraries that do that.)
I found significant accuracy impacts using float in the multiply-add accumulation, and neither SSE nor AVX provide a double = float x float operation, (analogous to the 32 = 16x16 integer instruction in SSE) which is what is really needed.
The SSE/AVX float->double cast is extremely slow - slower than reading a double from memory.

The code is therefore optimized to minimize the memory bandwidth, and the number of float->double conversions, while using double += double*double in the AVX code. (Since writing the AVX dot product, I thought of a better way of doing it, but haven't had time to implement that yet.) In any case, the time savings are small, a factor of <2.

A good integer implementation may squeeze better out of it, but I haven't seen it yet.OTOH, I haven't tried it lately. AVX2 and AVX512 extend the integer operations beyond the ones available on SSE (not on base AVX) and may make it worth it, when I get machine with AVX512. The additional benefit of (8 bit) integer is that it reduces the size of everything, making it more likely that data will stay in a higher-level cache.

Far greater performance improvements can be made by making the network smaller. As I already indicated, I have had some very good results in this area, with a network 3x faster than the legacy code (for English) and much faster than the legacy code for complex scripts.

I since messed up the synthetic data pipeline and training trying to improve Indic, but I have major improvements in some other languages, so when I get back to the best results for everything, I think you'll like it.

BTW I get a significant speed-up in walltime with the open MP code running. (Better than 2x) Did you compile and link it correctly with OpenMP? I have noticed that the CPU profile with OpenMp running is of little practical use.

On Mon, May 22, 2017 at 1:38 AM, Stefan Weil notifications@github.com wrote:

Performance is important for real time OCR, mass production OCR and training.

In this RFC I'd like to discuss performance bottlenecks and potential improvements.

See also the Tesseract wiki https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy-and-Performance .

According to my tests with Valgrind's tool callgrind, these functions have the largest computational costs (ordered by decreasing time):

memory allocations / deallocations

tesseract::UnicharIdArrayUtils::compare

tesseract::DotProductAVX (or the other implementations of the dot product)

vfprintf (called from snprintf)

tesseract::UnicharIdArrayUtils::compare and memory allocations / deallocations are also the functions which are called most often.

I recently had a closer look at the dot product calculations and noticed that at least some input vectors are converted from float to double (which takes time). The dot product is always done with double values (more expensive than float). If memory bandwidth is the limiting factor, using double means doubled time compared with float. The current code uses 4 parallel threads. I have run some timing tests without that parallelization and got nearly the same execution time. @theraysmith https://github.com/theraysmith, did you try using float for the dot product, and do you get better performance from parallelization in that part of the OCR process?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/943, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056TVhtIV-0-2TldC4EuvYMqx71Tb7ks5r8Ul6gaJpZM4NiCwj .

-- Ray.

stweil commented 7 years ago

Valgrind shows all callers, so I can provide that information (next week, as I'm currently busy with other things).

Regarding precision of the dot product: the addition is the critical part for the accuracy. Did you ever try some of the algorithms which help to improve that part, e. g. Kahan? Maybe that would be sufficient to allow using float everywhere.

I used OpenMP and disabled it only in lstm/fullyconnected.cpp and in lstm/weightmatrix.cpp. The original code of the master branch shows that OpenMP works, as the real time is much less than the user time:

real    2m50,958s
user    7m39,712s
sys     0m2,128s

With OpenMP disabled for the parts mentioned above, OpenMP still works, the real time increases moderately while the user time decreases:

real    3m9,378s
user    7m33,084s
sys     0m2,092s

stweil commented 7 years ago

Stack for tesseract::UnicharIdArrayUtils::compare (called 2651872 times for small hello world image):

tesseract::UnicharIdArrayUtils::compare
ELIST::add_sorted_and_find
ELIST::add_sorted
tesseract::UnicharAmbigs::InsertIntoTable
tesseract::UnicharAmbigs::LoadUnicharAmbigs
...

Shreeshrii commented 7 years ago

Is this for --oem 0 or --oem 1 ?

I thought that Unicharambigs is not used by LSTM engine...

stweil commented 7 years ago

It is for LSTM:

valgrind --tool=callgrind --dump-line=yes --verbose api/tesseract --oem 1 hello.png /tmp/hello

The output is a large file called something like callgrind.out.1234 (replace 1234 by the process id). I suggest using kcachegrind to see the results.

Shreeshrii commented 7 years ago

@stweil

Is that with debug? What about without it?

stweil commented 7 years ago

No, it was built with ./configure --disable-shared --disable-static 'CXXFLAGS=-Wall -g -O2'. I usually disable the library builds because they take additional time.

stweil commented 7 years ago

My first results reported above were from a very small image (single line hello world), so initialization contributes significantly.

With a really large image (newspaper), the result changes and tesseract::DotProductAVX is the dominating element. Surprisingly it is followed by gomp_team_barrier_wait_end and gomp_barrier_wait_end which according to Valgrind use nearly as much time as the dot product. Those two functions are part of OpenMP.

See also issue #898.

stweil commented 7 years ago

Unrolling the loop in tesseract::DotProductAVX results in nearly 7 % improvement:

# tesseract 0604.jp2 /tmp/0604 # git master, without OpenMP
real    2m54,469s
user    2m54,160s
sys 0m0,304s

# same test, but with improved tesseract::DotProductAVX
real    2m41,855s
user    2m41,576s
sys 0m0,272s

stweil commented 7 years ago

Latest numbers based on code from FAU Erlangen (thanks to @moebiusband73), Kahan part removed:

real    2m31,514s
user    2m31,220s
sys 0m0,280s

That is an improvement of 12 %. Using assembler code could improve further, but I'd expect the largest improvement from using float instead of double (trying to compensate the loss of precision by using the Kahan algorithm).

mlissner commented 7 years ago

The conversation here is largely over my head, but I came to the bug tracker to discuss performance in 4.0, and this bug is titled "RFC: Tesseract Performance" so it seems like the right place. (Apologies if I'm wrong.)

Simple question: On the "Neural nets In Tesseract" wiki page, it says:

On a machine with multiple cores, and AVX, an easy English image may take twice as much real time, and use 7 times the CPU as base Tesseract

But above (and elsewhere) says:

I have had some very good results in this area, with a network 3x faster than the legacy code (for English)

I do a lot of batch OCR using Tesseract. Can I expect 4.0 to be faster by 3× or slower by 7×? If the latter, that'll be something that we'll need to plan on, since our current infrastructure took over a month to complete the last batch OCR job using 3.02. If we need 7× more servers that's a huge deal — I don't know that we'd ever upgrade if that was the case. If it's 3× faster, that's incredible.

(Sorry again if this is the wrong place to bring this up. I'm trying to get a grasp on this situation. Thank you all for all your great work.)

stweil commented 7 years ago

The current 4.0 still supports the "old" OCR recognizer and is comparable to 3.05 if that is used (command line argument --oem 0), but there are plans to remove this feature from 4.0.

4.0 supports a new recognizer based on LSTM (--oem 1). The new code for LSTM uses parallel processing, so OCR can finish faster, but need more total CPU time. As 4.0 is still experimental, all timing results can change. Ray announced new trained language models for 4.0 with a smaller neural network. If that works, it could reduce CPU‌ time in comparison to 3.05.

Currently you won't reduce the infrastructure requirements with 4.0.

mlissner commented 7 years ago

That sounds promising, thanks @stweil. I'm perfectly happy not reducing infrastructure requirements, but increasing them 7× would be a very big deal.

FWIW, decreasing wall time via parallelization while increasing CPU time sounds good on paper, but unless I'm missing something (totally possible), it doesn't mean much for batch jobs like ours since we run 24 OCR processes in parallel already. Our server looks like this when doing a job:

screenshot from 2017-05-10 16-18-40

stweil commented 7 years ago

For your use case (which is similar to my own) I'd compile Tesseract without OpenMP support. Otherwise the parallelization will take a significant part of the CPU performance.

amitdo commented 7 years ago

What's the method you use to disable OpenMP?

Commenting AC_OPENMP or something else?

stweil commented 7 years ago

configure --disable-openmp

amitdo commented 7 years ago

--disable-shared --disable-static seems to be equivalent to just --disable-shared.

Shreeshrii commented 7 years ago

Please see https://github.com/tesseract-ocr/tesseract/issues/961#issuecomment-305420753

for another example of decrease in user time with --disable-openmp.

By disabling openmp, the user time has become almost one third on WSL on my AMD Windows 10 PC.

amitdo commented 5 years ago

Regarding precision of the dot product: the addition is the critical part for the accuracy. Did you ever try some of the algorithms which help to improve that part, e. g. Kahan? Maybe that would be sufficient to allow using float everywhere.

With some of NVIDIA's cards it's now possible to run the dotproduct on fp16 vectors. sum_fp32 += u_fp16 * v_fp16

Google's TPUs also have something similar, with bfloat16 instead of the standard fp16.

Mark-Joy commented 11 months ago

Simply put, the LSTM model (oem=1) only produce faster execution time than legacy model (oem=0) if the number of images is less than the (machine maximum threads/4)

Example: For an ordinary computer with maximum number of threads is 24, LSTM model gives faster execution time when processing a pdf with less than or equal to 6 pages (assuming 1 page is 1 image)

Mark-Joy commented 11 months ago

The current 4.0 still supports the "old" OCR recognizer and is comparable to 3.05 if that is used (command line argument --oem 0), but there are plans to remove this feature from 4.0.

Please don't remove this feature. The LSTM model needs a large number of threads(24 threads) just to process a small pdf(6 pages). Currently, the legacy model is the best solution for this job. Please poll us, the users - before remove this important feature.

btw, what is the usage scenario for tesseract 4 and 5 without the legacy model, if it turns out to be significant more time consuming in mass processing?

stweil commented 11 months ago

@Mark-Joy, that's only true if Tesseract is running with OpenMP multithreading enabled. I always suggest to disable multithreading because OpenMP wastes a lot of CPU resources. Then LSTM is still faster than legacy mode although it uses only a single CPU thread per page. On our OCR servers mass processing runs with 48 or 64 CPU threads which process the same number of page images simultaneously. And of course they use our self-trained LSTM models, not legacy models.

Mark-Joy commented 11 months ago

@stweil, Thank you for the information. With OpenMP disabled, may I ask which traineddata model can be used to get LTSM model faster than legacy one?

On my machine, only fast data model(oem=1) gives 10-15% faster than legacy, but it give unsatisfied results regarding text recognition. 6-7 out of 10 I would pick legacy result whenever there's a difference.

With traineddata that supports both legacy(oem=0) and LSTM(oem=1), LSTM gives 35% slower execution time.

The worst ones are the best data models(oem=1), LSTM gives 4-5 times(400%-500%) slower than legacy

tesseract-ocr / tesseract

RFC: Tesseract performance #943