Open jkarthic opened 1 week ago
Many thanks for this nice contribution.
With this pull request users have the choice of using the new argument --lstm-num-threads N
or setting the new parameter with -c lstm-num-threads=N
. Do we need both ways? If a command line argument is desired (like in the case of --dpi
), I think that there might be more user friendly variants. Although --lstm-num-thread
describes the technical implementation correctly, it is a lengthy argument which maybe requires too much explanation. Do we expect more --xxx-num-thread
arguments in the future? Or would --threads
be sufficient?
Maybe we could also extend the command line syntax to have --PARAMETER VALUE
as an alternative for -c PARAMETER=VALUE
for any Tesseract parameter.
Setting
lstm_num_threads
equal to the number of cores in the processor will give the best performance.
Just to clarify this statement: it's only true for the OCR of a single page. For mass production it is still better to run (number of cores) parallel Tesseract processes because then all processing steps use 100 % of the available resources.
Many thanks for this nice contribution.
And many thanks to you for reviewing this patiently.
With this pull request users have the choice of using the new argument
--lstm-num-threads N
or setting the new parameter with-c lstm-num-threads=N
. Do we need both ways?
This lstm_num_threads
is a init time parameter. The LSTMRecognizer
instances are created during init. Setting this new parameter with -c lstm-num-threads=N
will not work, as it is setting the variable after the init is done.
Although
--lstm-num-thread
describes the technical implementation correctly, it is a lengthy argument which maybe requires too much explanation. Do we expect more--xxx-num-thread
arguments in the future? Or would--threads
be sufficient?
When I tested tesseract with a psm of 3(which is the default for tesseract.exe
), page segmentation was taking significantly more time than the actual LSTM recognition. For example, in one of my tests, page segmentation was taking ~7 seconds, and lstm was taking ~3 seconds, taking the total to ~10 seconds. Users running with default psm parameter should not expect that the entire 10 seconds will be run in multiple threads. In this case, the major part of ~7 seconds will run single threaded and only a minor part of ~3 seconds will be multi threaded. Hence I thought adding a longer name is setting the user expectation right, that only a portion of tesseract will be running multithreaded.
Also there are other numthreads variables related to OpenMP, inside the code which were named generically such as kNumThreads
, __num_threads
and num_threads
. Naming this as lstm_num_threads
also differentiates this as a seperate variable, not to be confused with OpenMP num threads.
Setting
lstm_num_threads
equal to the number of cores in the processor will give the best performance.Just to clarify this statement: it's only true for the OCR of a single page. For mass production it is still better to run (number of cores) parallel Tesseract processes because then all processing steps use 100 % of the available resources.
Totally agreed. This is meant for latency-sensitive real-time applications, with ocr probably running in the consumer's device itself.
@stweil I observed a crash issue in the earlier code due to WERD_RES objects freed by one thread was used by another thread for iterating thru the WERD_RES singly linked list. To fix the above above issue, I have modified WERD_RES linked list to use shared pointer instead of raw pointers, so that lifetime of the objects are managed automatically. I have also added mutex protections around the PAGE_RES_IT functions that modify this list in order to avoid race conditions. Please take a look at the modifications whenever you get some time for this.
I suggest to use previous version as base.
Now it is much much worse.
@egorpugin I am not sure, if I understand your comment here. Could you please elaborate what is "much much worse"?
@egorpugin I am not sure, if I understand your comment here. Could you please elaborate what is "much much worse"?
You need to provide a very detailed description of: 1) algorithm. How it works? Is it possible to sync less? 2) changes in files. I see new types, mutex locks in some existing functions. See example how this can be described from gcc commit messages, e.g. https://github.com/gcc-mirror/gcc/commit/5185274c76cc3b68a38713273779ec29ae4fe5d2 (bottom part of the commit message)
Init time option lstm_num_threads should be used to set the number of LSTM threads. This will ensure that word recognition can run independently in multiple threads, thus effectively utilizing multi-core processors.
Following are my test results for a sample screenshot. CPU : Intel(R) Core(TM) i5-7500 CPU @ 3.40GHz OS : WIndows Compiler : MSVC 19.38.33130.0 (Installed from Visual Studio 2022) Model: eng.traineddata from tessfast PSM: 6
Total time taken for
Recognize
API call, Built without OpenMP With lstm_num_threads=1, total time taken = 3.95 seconds With lstm_num_threads=4, total time taken = 1.4 secondsOn the other hand, here are the numbers with OpenMP OMP_THREAD_LIMIT not set, total time taken = 3.59 seconds OMP_THREAD_LIMIT=4, total time taken = 3.57 seconds OMP_THREAD_LIMIT=1, total time taken = 4.19 seconds
As we can observe, this branch with
lstm_num_threads
set as 4, performs way better than the openmp multithreading supported currently. Settinglstm_num_threads
equal to the number of cores in the processor will give the best performance.