Run LSTM recognition in multiple threads

jkarthic commented 1 week ago

Init time option lstm_num_threads should be used to set the number of LSTM threads. This will ensure that word recognition can run independently in multiple threads, thus effectively utilizing multi-core processors.

Following are my test results for a sample screenshot. CPU : Intel(R) Core(TM) i5-7500 CPU @ 3.40GHz OS : WIndows Compiler : MSVC 19.38.33130.0 (Installed from Visual Studio 2022) Model: eng.traineddata from tessfast PSM: 6

Total time taken for Recognize API call, Built without OpenMP With lstm_num_threads=1, total time taken = 3.95 seconds With lstm_num_threads=4, total time taken = 1.4 seconds

On the other hand, here are the numbers with OpenMP OMP_THREAD_LIMIT not set, total time taken = 3.59 seconds OMP_THREAD_LIMIT=4, total time taken = 3.57 seconds OMP_THREAD_LIMIT=1, total time taken = 4.19 seconds

As we can observe, this branch with lstm_num_threads set as 4, performs way better than the openmp multithreading supported currently. Setting lstm_num_threads equal to the number of cores in the processor will give the best performance.

stweil commented 1 week ago

Many thanks for this nice contribution.

With this pull request users have the choice of using the new argument --lstm-num-threads N or setting the new parameter with -c lstm-num-threads=N. Do we need both ways? If a command line argument is desired (like in the case of --dpi), I think that there might be more user friendly variants. Although --lstm-num-thread describes the technical implementation correctly, it is a lengthy argument which maybe requires too much explanation. Do we expect more --xxx-num-thread arguments in the future? Or would --threads be sufficient?

Maybe we could also extend the command line syntax to have --PARAMETER VALUE as an alternative for -c PARAMETER=VALUE for any Tesseract parameter.

stweil commented 1 week ago

Setting lstm_num_threads equal to the number of cores in the processor will give the best performance.

Just to clarify this statement: it's only true for the OCR of a single page. For mass production it is still better to run (number of cores) parallel Tesseract processes because then all processing steps use 100 % of the available resources.

jkarthic commented 1 week ago

Many thanks for this nice contribution.

And many thanks to you for reviewing this patiently.

With this pull request users have the choice of using the new argument --lstm-num-threads N or setting the new parameter with -c lstm-num-threads=N. Do we need both ways?

This lstm_num_threads is a init time parameter. The LSTMRecognizer instances are created during init. Setting this new parameter with -c lstm-num-threads=N will not work, as it is setting the variable after the init is done.

Although --lstm-num-thread describes the technical implementation correctly, it is a lengthy argument which maybe requires too much explanation. Do we expect more --xxx-num-thread arguments in the future? Or would --threads be sufficient?

When I tested tesseract with a psm of 3(which is the default for tesseract.exe), page segmentation was taking significantly more time than the actual LSTM recognition. For example, in one of my tests, page segmentation was taking ~7 seconds, and lstm was taking ~3 seconds, taking the total to ~10 seconds. Users running with default psm parameter should not expect that the entire 10 seconds will be run in multiple threads. In this case, the major part of ~7 seconds will run single threaded and only a minor part of ~3 seconds will be multi threaded. Hence I thought adding a longer name is setting the user expectation right, that only a portion of tesseract will be running multithreaded. Also there are other numthreads variables related to OpenMP, inside the code which were named generically such as kNumThreads, __num_threads and num_threads. Naming this as lstm_num_threads also differentiates this as a seperate variable, not to be confused with OpenMP num threads.

jkarthic commented 1 week ago

Setting lstm_num_threads equal to the number of cores in the processor will give the best performance.

Just to clarify this statement: it's only true for the OCR of a single page. For mass production it is still better to run (number of cores) parallel Tesseract processes because then all processing steps use 100 % of the available resources.

Totally agreed. This is meant for latency-sensitive real-time applications, with ocr probably running in the consumer's device itself.

jkarthic commented 7 hours ago

@stweil I observed a crash issue in the earlier code due to WERD_RES objects freed by one thread was used by another thread for iterating thru the WERD_RES singly linked list. To fix the above above issue, I have modified WERD_RES linked list to use shared pointer instead of raw pointers, so that lifetime of the objects are managed automatically. I have also added mutex protections around the PAGE_RES_IT functions that modify this list in order to avoid race conditions. Please take a look at the modifications whenever you get some time for this.

egorpugin commented 6 hours ago

I suggest to use previous version as base.

jkarthic commented 5 hours ago

Now it is much much worse.

@egorpugin I am not sure, if I understand your comment here. Could you please elaborate what is "much much worse"?

egorpugin commented 5 hours ago

@egorpugin I am not sure, if I understand your comment here. Could you please elaborate what is "much much worse"?

More complex code.
A lot of sync.
Much harder to review.
Most likely a 'no go' in current state.

You need to provide a very detailed description of: 1) algorithm. How it works? Is it possible to sync less? 2) changes in files. I see new types, mutex locks in some existing functions. See example how this can be described from gcc commit messages, e.g. https://github.com/gcc-mirror/gcc/commit/5185274c76cc3b68a38713273779ec29ae4fe5d2 (bottom part of the commit message)

tesseract-ocr / tesseract

Run LSTM recognition in multiple threads #4275