cache performance and inline functions

From general testing, it appears as though I'm I/O bound while using tesseract, and not CPU bound. I checked out the CPU Cache performance (profiled with valgrind and perf, but perf's output is easiest to read) and found that my machines have a high number of cache misses, and even a high number of iTLB misses. After inspecting the code, I also noticed that almost every function was defined as 'inline'. The high rate of iTLB misses surprised me. Usually iTLB misses are reduced after running a program for a longer period of time, but increasing the run time (for instance with an image with a lot more text) saw my iTLB miss rate increase, perhaps indicating that I am filling my iTLB, which is rare.

The iTLB misses worry me, especially because running a program for a longer period of time typically reduces them (one inevitably has cache/tlb misses when executing something that the CPU has not seen, but once it is in the cache, it should stay in the cache...). iTLB misses are expensive on Intel CPUs. I have used OpenMP in the past, but I'm not sure how it handles an iTLB miss per 'thread'. If each thread hits a miss and hits a page fault exception, and it's non blocking (which makes sense), then it'd mean that each thread would cause a page walk for the same instructions. Not sure if that's what happens, but I'd love to know what does happen (or a book or resource) if someone knows more about how OpenMP handles them.

I have seen code styled similarly at my last job, where the main part of the program was written over 25 years ago. At the time I expressed surprise (and alarm) to a co-worker, who explained that compilers used to be terrible at attempting to inline functions, and it wasn't until GCC 4 that there was a decent attempt to improve them. GCC 4 was released in 2005. (The version of GCC 4 you are using now has many updates and improvements, and is different from the GCC 4 in 2005, so GCC 4 is not that 'outdated', but it was the first major time where there were improvements.) So before 2005, best practice was to write code similar to how Tesseract identifies almost all functions as 'inline'.

The 'inline' specifier carries some interesting aspects too, and was more 'important' to use with C, since C++ defaults to 'inline' when... (From https://gcc.gnu.org/onlinedocs/gcc-7.1.0/gcc/Inline.html#Inline)

As required by ISO C++, GCC considers member functions defined within the body of a class to be marked inline even if they are not explicitly declared with the inline keyword.

Also, GCC can ignore 'inline' as it sees fit. The -O2 optimization option includes -finline-small-functions, which inlines functions that do not generate additional code. -O3 includes -finline-functions, which is more or less the same as writing 'inline' before every function, and only ignoring it if GCC decides that it may reduce performance. -O2 is in the default configure options, so adding 'inline' before each function is similar to just compiling with '-finline-functions', except that it is slightly more explicit.

There are many downsides to defining a function as 'inline', but the one people usually seem to notice is increased compile time. There's plenty of information available on Google/GNU's documentation about inline, the history of inline, and slight differences between inline in g++ and GNU C back in the day.

The biggest example of what I'd call (modern) "overactive inline use" in tesseract is probably in these files: https://github.com/tesseract-ocr/tesseract/blob/master/lstm/functions.h https://github.com/tesseract-ocr/tesseract/blob/master/lstm/functions.cpp

functions.cpp has two lines of code, with almost everything in functions.h.

This is also an example of compiler output with default options, i.e. keeping '-O2' and all of the different functions 'inlined' (with -Winline: https://gcc.gnu.org/onlinedocs/gcc-7.1.0/gcc/Inline.html#Inline ), defined as:

Using -Winline warns when a function marked inline could not be substituted, and gives the reason for the failure.

functions.h:63:15: warning: inlining failed in call to ‘double tesseract::Logistic(double)’: call is unlikely and code size would grow [-Winline]
 inline double Logistic(double x) {
               ^~~~~~~~
functions.h:64:37: note: called from here
   if (x < 0.0) return 1.0 - Logistic(-x);
                             ~~~~~~~~^~~~
functions.h:63:15: warning: inlining failed in call to ‘double tesseract::Logistic(double)’: call is unlikely and code size would grow [-Winline]
 inline double Logistic(double x) {
               ^~~~~~~~
functions.h:64:37: note: called from here
   if (x < 0.0) return 1.0 - Logistic(-x);
                             ~~~~~~~~^~~~
functions.h:63:15: warning: inlining failed in call to ‘double tesseract::Logistic(double)’: call is unlikely and code size would grow [-Winline]
 inline double Logistic(double x) {
               ^~~~~~~~
functions.h:64:37: note: called from here
   if (x < 0.0) return 1.0 - Logistic(-x);
                             ~~~~~~~~^~~~
functions.h:63:15: warning: inlining failed in call to ‘double tesseract::Logistic(double)’: call is unlikely and code size would grow [-Winline]
 inline double Logistic(double x) {
               ^~~~~~~~
functions.h:64:37: note: called from here
   if (x < 0.0) return 1.0 - Logistic(-x);
                             ~~~~~~~~^~~~
functions.h:45:15: warning: inlining failed in call to ‘double tesseract::Tanh(double)’: call is unlikely and code size would grow [-Winline]
 inline double Tanh(double x) {
               ^~~~
functions.h:46:28: note: called from here
   if (x < 0.0) return -Tanh(-x);
                        ~~~~^~~~

As a test on my target machines, I copied my 'tesseract' repository, removed all (easy to remove...) 'inline' functions, shifted the definition to the .cpp file instead of the .h file (leaving the declaration in .h as is standard practice), and compiled a separate version, installing it with a different prefix. This allowed me to have multiple versions of Tesseract on the same machine, and to run tests with all of them.

Next, I made a script. Fairly simple, utilizing GNU Parallel CITATION:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.

The simple bash scripts look like this:

#!/bin/bash
# CONTROL SCRIPT
TESS=/usr/bin/tesseract
F_LOC=(location of my test images)
parallel $TESS $F_LOC/{1} stdout ::: img1.jpg img2.png img3.jpeg

and

#!/bin/bash
# TEST SCRIPT
TESS=/home/hotchkiss/usr/bin/tesseract
F_LOC=(location of my test images)
parallel $TESS $F_LOC/{1} stdout ::: img1.jpg img2.png img3.jpeg

I'd run them ~50 times randomly(as random as /dev/urandom can afford on two options), 'randomly' defined as I randomly chose which to run, and collected the average run times and percentages as an aggregate using perf. Each run was after a fresh reboot, with no other programs running other than X, i3 (my windows manager), and my default kernel + kernel modules (less than 110MB when running 'free -m').

Results look something like:

 Performance counter stats for 'test_tess':

      54913.737862      task-clock:u (msec)       #    6.757 CPUs utilized          
                 0      context-switches:u        #    0.000 K/sec                  
                 0      cpu-migrations:u          #    0.000 K/sec                  
           136,385      page-faults:u             #    0.002 M/sec                  
   161,251,587,471      cycles:u                  #    2.936 GHz                      (29.05%)
   154,963,746,273      instructions:u            #    0.96  insn per cycle           (36.65%)
    35,080,176,770      branches:u                #  638.823 M/sec                    (42.77%)
       306,913,327      branch-misses:u           #    0.87% of all branches          (41.68%)
    50,274,555,586      L1-dcache-loads:u         #  915.519 M/sec                    (27.81%)
     5,075,515,864      L1-dcache-load-misses:u   #   10.10% of all L1-dcache hits    (19.72%)
       450,004,836      LLC-loads:u               #    8.195 M/sec                    (17.70%)
        23,036,525      LLC-load-misses:u         #   10.24% of all LL-cache hits     (21.92%)
   <not supported>      L1-icache-loads:u                                           
        11,674,082      L1-icache-load-misses:u                                       (28.02%)
    48,676,934,856      dTLB-loads:u              #  886.425 M/sec                    (21.31%)
         1,322,784      dTLB-load-misses:u        #    0.00% of all dTLB cache hits   (20.86%)
           264,354      iTLB-loads:u              #    0.005 M/sec                    (17.36%)
           690,732      iTLB-load-misses:u        #  261.29% of all iTLB cache hits   (21.42%)
   <not supported>      L1-dcache-prefetches:u                                      
   <not supported>      L1-dcache-prefetch-misses:u                                   

       8.127386606 seconds time elapsed

As expected, with no compiler optimizations and all functions with the 'inline' identifier removed, the performance was worse by a significant amount (8% to 10% worse). Also as expected with modern compilers (gcc 6.3 and gcc 8.0... I haven't run 7.x stable yet, although I could tomorrow), the performance seems to be ~the same and (perhaps, although requires more testing to be definite) better on a CPU with smaller cache/tlb sizes.

I used parallels to ensure that things were running in parallel, similar to what I will be using in production, and, what (hopefully) everyone else is doing too. I checked the other things people have posted, and it seems as though most others are doing the same too.

What I'm asking

How are you collecting cache and performance information, is it standardized, and how would you like any statistics or data to be submitted/displayed?
Is it alright if I remove and submit code without 'inline' declared on every function?
Should I submit code with 'inline' removed from functions others have written?
What are your (other developer's) target machines like, and are my results and thought process similar?

Personally, I'm not a fan of manually marking all functions as inline, but I'm also (quite) used to re-writing code/libraries to optimize performance for my use case(s).

I can wait on posting any code changes until @theraysmith has the new beta tag out though. I'm not sure how these types of issues are handled on github either. I've spent most of my time working at a FFRDC, and haven't submitted much FOSS code publicly at all because of it, but I can post what I've done with this, which is great. Typically, I'd either write my own tests and submit along with compiled code (and a 'NO MERGE' pull request), or follow already provided tests and submit according to them, however I haven't seen any. What do you all prefer to use, and would this be something you're interested in? I absolutely abhor, detest, and strongly oppose programming for the compiler, but I feel like these changes would actually do the opposite, and allow the compiler to take over.

I'm also profiling functions for more detailed optimization with cachegrind, and I'll absolutely post those after the changes have been submitted. Again, it's less of a 'logic' change, and in some cases more of a logic 'reordering' (like order of conditional if tests, what to inline/skip/spend more time on and the like).

Another option that (might?) be decent, if it turns out that removing 'inline' does hurt performance on CPUs with large cache/tlb spaces, is that I can do something similar to what the Linux Kernel has. There is a "Allow gcc to uninline functions marked 'inline'" option (under 'Kernel Hacking') because they have run into the same issue.

Environment

Tesseract Version: 4.0x(latest -dev)
Platform: Linux [hostname] 4.12.0 #1 SMP Tue Jul 11 14:56:49 EDT 2017 x86_64 Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz GenuineIntel GNU/Linux (and, specs not with me right now, but posting anyway, an Intel Atom, can post later).

tested with: gcc version 8.0.0 20170711 (experimental) (GCC) gcc version 6.3.0 (Gentoo 6.3.0 p1.0) I can test with gcc 7.x later if need be. I haven't switched my development system's stable version of gcc to 7 yet, even though I still test the experimental svn branch.

I haven't measured cache performance in detail as you have - more measured real-time performance under various conditions and speculated as to the cause ;-) The detail you provide is very interesting.

One thing I have noticed is that performance of OpenMP varies a lot according to CPU load, so you will get very poor performance trying to run multiple tesseracts in parallel. A major performance improvement is possible under such circumstances if you pin OpenMP threads to cores, which can be done using an environment variable: https://www.nas.nasa.gov/hecc/support/kb/using-intel-openmp-thread-affinity-for-pinning_285.html I think what happens is that the OpenMP or the OS moves the threads around the cores, when it is busy, for no good reason at all, other than wanting to run other processes, which kills cache performance.

The weight matrices are mostly small enough to fit in core-level cache, but if they get moved, it will be much slower. The effect may be more intense with the smaller models that I have coming soon.

I have often wondered whether it would work more reliably faster if re-written to not use Open MP, but use low-level threads instead. That would be a lot of work, and may not pay off, so I haven't done it. Alternatively, if you want to run it on multiple images in parallel using the parallel command, you might just get better performance by turning off OpenMP entirely and running it single-threaded.

I agree that some of the (larger) code in functions.h should be de-inlined.

I have a question on you opening statement. When you say I/O bound, I assume you mean between the core and main memory, not I/O to disk?

On Wed, Jul 12, 2017 at 6:46 PM, Justin Hotchkiss Palermo < notifications@github.com> wrote:

From general testing, it appears as though I'm I/O bound while using tesseract, and not CPU bound. I checked out the CPU Cache performance (profiled with valgrind and perf, but perf's output is easiest to read) and found that my machines have a high number of cache misses, and even a high number of iTLB misses. After inspecting the code, I also noticed that almost every function was defined as 'inline'. The high rate of iTLB misses surprised me. Usually iTLB misses are reduced after running a program for a longer period of time, but increasing the run time (for instance with an image with a lot more text) saw my iTLB miss rate increase, perhaps indicating that I am filling my iTLB, which is rare.

The iTLB misses worry me, especially because running a program for a longer period of time typically reduces them (one inevitably has cache/tlb misses when executing something that the CPU has not seen, but once it is in the cache, it should stay in the cache...). iTLB misses are expensive on Intel CPUs. I have used OpenMP in the past, but I'm not sure how it handles an iTLB miss per 'thread'. If each thread hits a miss and hits a page fault exception, and it's non blocking (which makes sense), then it'd mean that each thread would cause a page walk for the same instructions. Not sure if that's what happens, but I'd love to know what does happen (or a book or resource) if someone knows more about how OpenMP handles them.

I have seen code styled similarly at my last job, where the main part of the program was written over 25 years ago. At the time I expressed surprise (and alarm) to a co-worker, who explained that compilers used to be terrible at attempting to inline functions, and it wasn't until GCC 4 that there was a decent attempt to improve them. GCC 4 was released in 2005. (The version of GCC 4 you are using now has many updates and improvements, and is different from the GCC 4 in 2005, so GCC 4 is not that 'outdated', but it was the first major time where there were improvements.) So before 2005, best practice was to write code similar to how Tesseract identifies almost all functions as 'inline'.

The 'inline' specifier carries some interesting aspects too, and was more 'important' to use with C, since C++ defaults to 'inline' when... (From https://gcc.gnu.org/onlinedocs/gcc-7.1.0/gcc/Inline.html#Inline http://url)

As required by ISO C++, GCC considers member functions defined within the body of a class to be marked inline even if they are not explicitly declared with the inline keyword.

Also, GCC can ignore 'inline' as it sees fit. The -O2 optimization option includes -finline-small-functions, which inlines functions that do not generate additional code. -O3 includes -finline-functions, which is more or less the same as writing 'inline' before every function, and only ignoring it if GCC decides that it may reduce performance. -O2 is in the default configure options, so adding 'inline' before each function is similar to just compiling with '-finline-functions', except that it is slightly more explicit.

There are many downsides to defining a function as 'inline', but the one people usually seem to notice is increased compile time. There's plenty of information available on Google/GNU's documentation about inline, the history of inline, and slight differences between inline in g++ and GNU C back in the day.

The biggest example of what I'd call (modern) "overactive inline use" in tesseract is probably in these files: https://github.com/tesseract-ocr/tesseract/blob/master/lstm/functions.h https://github.com/tesseract-ocr/tesseract/blob/master/lstm/functions.cpp

functions.cpp has two lines of code, with almost everything in functions.h.

This is also an example of compiler output with default options, i.e. keeping '-O2' and all of the different functions 'inlined' (with -Winline: https://gcc.gnu.org/onlinedocs/gcc-7.1.0/gcc/Inline.html#Inline ), defined as:

Using -Winline warns when a function marked inline could not be substituted, and gives the reason for the failure.

functions.h:63:15: warning: inlining failed in call to ‘double tesseract::Logistic(double)’: call is unlikely and code size would grow [-Winline] inline double Logistic(double x) { ^~~~ functions.h:64:37: note: called from here if (x < 0.0) return 1.0 - Logistic(-x);
functions.h:63:15: warning: inlining failed in call to ‘double tesseract::Logistic(double)’: call is unlikely and code size would grow [-Winline]
 inline double Logistic(double x) {
               ^~~~~~~~
functions.h:64:37: note: called from here
   if (x < 0.0) return 1.0 - Logistic(-x);
                             ~~~~~~~~^~~~
functions.h:63:15: warning: inlining failed in call to ‘double tesseract::Logistic(double)’: call is unlikely and code size would grow [-Winline]
 inline double Logistic(double x) {
               ^~~~~~~~
functions.h:64:37: note: called from here
   if (x < 0.0) return 1.0 - Logistic(-x);
                             ~~~~~~~~^~~~
functions.h:63:15: warning: inlining failed in call to ‘double tesseract::Logistic(double)’: call is unlikely and code size would grow [-Winline]
 inline double Logistic(double x) {
               ^~~~~~~~
functions.h:64:37: note: called from here
   if (x < 0.0) return 1.0 - Logistic(-x);
                             ~~~~~~~~^~~~
functions.h:45:15: warning: inlining failed in call to ‘double tesseract::Tanh(double)’: call is unlikely and code size would grow [-Winline]
 inline double Tanh(double x) {
               ^~~~
functions.h:46:28: note: called from here
   if (x < 0.0) return -Tanh(-x);
                        ~~~~^~~~

As a test on my target machines, I copied my 'tesseract' repository,
removed all (easy to remove...) 'inline' functions, shifted the definition
to the .cpp file instead of the .h file (leaving the declaration in .h as
is standard practice), and compiled a separate version, installing it with
a different prefix. This allowed me to have multiple versions of Tesseract
on the same machine, and to run tests with all of them.

Next, I made a script. Fairly simple, utilizing GNU Parallel
CITATION:

O. Tange (2011): GNU Parallel - The Command-Line Power Tool,
;login: The USENIX Magazine, February 2011:42-47.

The simple bash scripts look like this:

#!/bin/bash# CONTROL SCRIPT
TESS=/usr/bin/tesseract
F_LOC=(location of my test images)
parallel $TESS $F_LOC/{1} stdout ::: img1.jpg img2.png img3.jpeg

and

#!/bin/bash# TEST SCRIPT
TESS=/home/hotchkiss/usr/bin/tesseract
F_LOC=(location of my test images)
parallel $TESS $F_LOC/{1} stdout ::: img1.jpg img2.png img3.jpeg

I'd run them ~50 times randomly(as random as /dev/urandom can afford on
two options), 'randomly' defined as I randomly chose which to run, and
collected the average run times and percentages as an aggregate using perf.
Each run was after a fresh reboot, with no other programs running other
than X, i3 (my windows manager), and my default kernel + kernel modules
(less than 110MB when running 'free -m').

Results look something like:

 Performance counter stats for 'test_tess':

      54913.737862      task-clock:u (msec)       #    6.757 CPUs utilized
                 0      context-switches:u        #    0.000 K/sec
                 0      cpu-migrations:u          #    0.000 K/sec
           136,385      page-faults:u             #    0.002 M/sec
   161,251,587,471      cycles:u                  #    2.936 GHz                      (29.05%)
   154,963,746,273      instructions:u            #    0.96  insn per cycle           (36.65%)
    35,080,176,770      branches:u                #  638.823 M/sec                    (42.77%)
       306,913,327      branch-misses:u           #    0.87% of all branches          (41.68%)
    50,274,555,586      L1-dcache-loads:u         #  915.519 M/sec                    (27.81%)
     5,075,515,864      L1-dcache-load-misses:u   #   10.10% of all L1-dcache hits    (19.72%)
       450,004,836      LLC-loads:u               #    8.195 M/sec                    (17.70%)
        23,036,525      LLC-load-misses:u         #   10.24% of all LL-cache hits     (21.92%)
   <not supported>      L1-icache-loads:u
        11,674,082      L1-icache-load-misses:u                                       (28.02%)
    48,676,934,856      dTLB-loads:u              #  886.425 M/sec                    (21.31%)
         1,322,784      dTLB-load-misses:u        #    0.00% of all dTLB cache hits   (20.86%)
           264,354      iTLB-loads:u              #    0.005 M/sec                    (17.36%)
           690,732      iTLB-load-misses:u        #  261.29% of all iTLB cache hits   (21.42%)
   <not supported>      L1-dcache-prefetches:u
   <not supported>      L1-dcache-prefetch-misses:u

       8.127386606 seconds time elapsed

As expected, with no compiler optimizations and all functions with the
'inline' identifier removed, the performance was worse by a significant
amount (8% to 10% worse). Also as expected with modern compilers (gcc 6.3
and gcc 8.0... I haven't run 7.x stable yet, although I could tomorrow),
the performance seems to be ~the same and (perhaps, although requires more
testing to be definite) better on a CPU with smaller cache/tlb sizes.

I used parallels to ensure that things were running in parallel, similar
to what I will be using in production, and, what (hopefully) everyone else
is doing too. I checked the other things people have posted, and it seems
as though most others are doing the same too.

I am posting this for a few reasons.

   1. How are you collecting cache and performance information, is it
   standardized, and how would you like any statistics or data to be
   submitted/displayed?
   2. Is it alright if I remove and submit code without 'inline' declared
   on every function?
   3. Should I submit code with 'inline' removed from functions others
   have written?
   4. What are your (other developer's) target machines like, and are my
   results and thought process similar?

Personally, I'm not a fan of manually marking all functions as inline, but
I'm also (quite) used to re-writing code/libraries to optimize performance
for my use case(s).

I can wait on posting any code changes until @theraysmith
<https://github.com/theraysmith> has the new beta tag out though. I'm not
sure how these types of issues are handled on github either. I've spent
most of my time working at a FFRDC, and haven't submitted much FOSS code
publicly at all because of it, but I can post what I've done with this,
which is great. Typically, I'd either write my own tests and submit along
with compiled code (and a 'NO MERGE' pull request), or follow already
provided tests and submit according to them, however I haven't seen any.
What do you all prefer to use, and would this be something you're
interested in? I absolutely abhor, detest, and strongly oppose programming
for the compiler, but I feel like these changes would actually do the
opposite, and allow the compiler to take over.

I'm also profiling functions for more detailed optimization with
cachegrind, and I'll absolutely post those after the changes have been
submitted. Again, it's less of a 'logic' change, and in some cases more of
a logic 'reordering' (like order of conditional if tests, what to
inline/skip/spend more time on and the like).

Another option that (might?) be decent, if it turns out that removing
'inline' does hurt performance on CPUs with large cache/tlb spaces, is that
I can do something similar to what the Linux Kernel has. There is a "Allow
gcc to uninline functions marked 'inline'" option (under 'Kernel Hacking')
because they have run into the same issue.
------------------------------
Environment

   - *Tesseract Version*: 4.0x(latest -dev)
   - *Platform*: Linux [hostname] 4.12.0 #1
   <https://github.com/tesseract-ocr/tesseract/issues/1> SMP Tue Jul 11
   14:56:49 EDT 2017 x86_64 Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
   GenuineIntel GNU/Linux
   (and, specs not with me right now, but posting anyway, an Intel Atom,
   can post later).

tested with:
gcc version 8.0.0 20170711 (experimental) (GCC)
gcc version 6.3.0 (Gentoo 6.3.0 p1.0)
I can test with gcc 7.x later if need be. I haven't switched my
development system's stable version of gcc to 7 yet, even though I still
test the experimental svn branch.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/tesseract-ocr/tesseract/issues/1035>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AL056byjoySZrjs2w14KyfUye2KocwQ8ks5sNXbogaJpZM4OWZ7L>
.

-- Ray.

What are your (other developer's) target machines like, and are my results and thought process similar?

I'm just processing a larger number of images and run Tesseract on 4 Xeon servers of different age with (16 + 16 + 32 + 8) tesseract processes using GNU parallel. The binaries were built without OpenMP to avoid the overhead related with multithreading and to allow optimal use of the available CPU cores (one process per core). This setting produced 150,000 hOCR files during the last 10 days (example).

Regarding OpenMP thread affinity: I think that the problem won't be fixed by binding OpenMP threads to a fixed CPU core nor will other thread libraries help. The current Tesseract simply creates too many threads even if a single process is started because it uses interleaved #pragma omp parallel statements. As soon as more than one process is running things get even worse because the number of threads then will exceed the number of CPU cores, and scheduling overhead will occur.

because the number of threads then will exceed the number of CPU cores

but you can limit the number of threads being used.

I'm afraid you can limit it only by recompilation (either without OpenMP or with modified source code).

Well, we can change the code...

(if we can prove the change will really make things better)

Did you try OMP_THREAD_LIMIT=1 tesseract...?

No, I did not – obviously I missed that environment variable. I only had tried OMP_NUM_THREADS. Setting OMP_THREAD_LIMIT=1 indeed stops thread creation. Many thanks. Now I no longer have to build special binaries without OpenMP.

Great!

I just discovered this variable an hour ago :-) https://gcc.gnu.org/onlinedocs/libgomp/Environment-Variables.html

@stweil,

Regarding OpenMP thread affinity: I think that the problem won't be fixed by binding OpenMP threads to a fixed CPU core nor will other thread libraries help. The current Tesseract simply creates too many threads even if a single process is started because it uses interleaved #pragma omp parallel statements. As soon as more than one process is running things get even worse because the number of threads then will exceed the number of CPU cores, and scheduling overhead will occur.

https://github.com/tesseract-ocr/tesseract/search?q=omp+path%3Asrc%2Flstm

Maybe we should remove some of the #pragma omp parallel for to prevent the interleaving issue?

https://github.com/tesseract-ocr/tesstrain/issues/259#issuecomment-1003122397

I'm OK with disabling OpenMP by default in 5.0.1 and 4.1.4.

I still think OpenMP in Tesseract can be improved (future work).

tesseract-ocr / tesseract

cache performance and inline functions #1035

What I'm asking

Environment