Closed ychtioui closed 2 years ago
Adding more consoles which run R with Tesseract until all CPU cores are fully used is one way how you can get maximum throughput.
Hi, I have a similar but slightly different problem here. I am using Python 3.7 with Tesseract 3.02. And I am new to Tesseract. I used pytesseract.image_to_string function, and it took me a long duration on the "first run".
'Cuz my associate professor
at college advises the club.
Duration: 259.72785544395447
'Cuz my associate professor
at college advises the club.
Duration: 0.9130520820617676
Can anyone please explain to me why will it happened? This is the 2nd day I am using Tesseract. Thank you.
from PIL import Image
import time
pytesseract.pytesseract.tesseract_cmd = "C:/Program Files (x86)/Tesseract-OCR/tesseract.exe"
start_time = time.time()
img = Image.open("Pic5.png")
print("start")
result = pytesseract.image_to_string(img, config='--tessdata-dir "C:/Program Files (x86)/Tesseract-OCR/tessdata"')
print(result)
duration = time.time() - start_time
print("\nDuration:", duration)
@WaltPeter, you are obviously running on Windows, so anything can happen in the background and delay your test, for example AV scans, disk defragmentation or software updates. Try running your test many times to see how times vary.
A Python program name.py is compiled at the first run into a name.pyc, but that should not take more than a second. You can remove all *.pyc files to force a new compilation.
So, guys... How speed things up? Any practical ideas?
I get the same issue with Tesseract 4.0.0 beta upon my Centos 7.3 setup. It takes 0.91 second to detect one character. Anything updated for this issue?
Just a detail, but I recommend using OMP_THREAD_LIMIT=1
so that tesseract runs in single thread mode.
By default, tesseract runs in multithread mode but apparently this just burns out CPU cycles without benefits. Here is an example on a 4 cores machine:
root@ubuntu-16gb-nbg1-1:/fv# export OMP_THREAD_LIMIT=1
root@ubuntu-16gb-nbg1-1:/fv# time tesseract 2.tif 2.txt
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Page 1
...
Page 12
real 0m34.300s
user 0m33.682s
sys 0m0.617s
root@ubuntu-16gb-nbg1-1:/fv# export OMP_THREAD_LIMIT=4
root@ubuntu-16gb-nbg1-1:/fv# time tesseract 2.tif 2.txt
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Page 1
...
Page 12
real 0m31.943s
user 1m19.374s
sys 0m1.346s
Consumes three times more CPU while not even 10% faster.
Yes, I can confirm that. For mass production I'd even build an executable without OpenMP support (configure --disable-openmp ...
) to remove the remaining overhead.
Sounds like we should change the build defaults if OpenMP is providing no real benefit.
I have no idea how the multithreading takes place but I have a feeling it's too low level, resulting in more overhead than gains. If the document's pages as a whole would be processed in parallel, that would probably be a real boost!
I turn off default openmp usage for cmake. Patch for autotools is welcomed (I have no possibility to get to my linux machine soon).
I'm not speculating anything. The reality is that TesseRact takes more than 3 seconds to read the above image that I initially attached (I use VS2010). When I use the console test application that comes with the TesseRact, it takes about the same time (more than 3 seconds).
Anyone would speculate a lot in 3 seconds
I have more than 20 years in machine vision. I used several OCR engines in the past. Actually I have one -in house- that reads the same image in less than 100ms, but our engine is designed more for reading a single line of text (i.e. it returns a single line of text).
TesseRact database is not that large. Most of the techniques used by TesseRact are quite standard in the OCR-area (page layout, line extraction, possible character extraction, word forming, and then several phases of classification). However, the TesseRact manages very badly memory usage. why? it takes more than 3 seconds to read a typical texted-image.
please if you're not bringing any meaningful ideas to my posting, just spare me your comment.
Hi @ychtioui ,
i am in the same case as you. I have many single line text images and i want to know if you can suggest me a fast and good OCR like the OCR you specify in example.
Thank in advance.
Use multi-threading in your application. Initialize N instances of TessBaseAPI. N should be the number of CPU cores. Each instance should handle a different image.
Can you explain this @amitdo
Yes, I can confirm that. For mass production I'd even build an executable without OpenMP support (
configure --disable-openmp ...
) to remove the remaining overhead.
@stweil Apart from disabling openmp
, would you suggest any other changes to increase speed.
We noticed some time ago that the Linux kernel version can have a huge effect on the OCR performance, namely more than 20 % slower in some versions because of workarounds for SPECTRE and MELTDOWN. Those workarounds can be disabled using kernel parameters. I expect similar effects for other operating systems, too.
@stweil : I'd like to ask whether it is viable in a cloud environment, since I'd be deploying on cloud, and I don't know whether it can be done or not? Also, referencing zdenko's comment on opencl
opencl for more speed , while doing ./configure --help
, I noticed --enable-opencl enable opencl build [default=no]
, do you think that would help too?
My personal experience is that Tesseract runs best on real hardware, virtual machines / cloud environments are often slower. There is initial experimental support for OpenCL in the Tesseract code, but as it is only initial and experimental, I cannot recommend it unless you want to work on improving it. You won't see better performance with the current code.
I didn't get this one in the last comment, but there is also --with-tensorflow support TensorFlow [default=check]
optional package, and I am guessing that it has to do with the lstm
network, but is it for cuda
based gpu
usage?
Indeed that would be another way to get faster OCR, but it requires special traineddata
model files for Tensorflow. As far as I know nobody has ever created such a file and used it with Tesseract + Tensorflow.
I just run some simple speed tests in python with current code (Intel Core i7-6600U CPU @ 2.60GHz, 2801 MHz, Cores: 2, Threads: 4; Windows 64 bit) and here are results:
Optimization | tessdata_best | tessdata_fast | tessdata |
---|---|---|---|
None | 48.9555 | 8.0645 | 13.3477 |
AVX, AVX2, FMA, SSE | 19.0863 | 3.3139 | 4.9020 |
Improvement None/AVX | 156% | 143% | 172% |
Additional: | |||
None + no_invert | 35.4278 | 5.0341 | 11.4808 |
AVX, AVX2, FMA, SSE + no_invert | 13.8921 | 2.7461 | 3.6696 |
Improvement AVX/AVX no_invert | 37% | 21% | 34% |
UPDATE 2019-10-06: recent tesseract code allows to use option "-c tessedit_do_invert=0" which brings extra speed.
I used image from this issue, eng lang, no openmp, without specifying any parameter (e.g. default oem, psm...), duration is calculated as arithmetic average of 5 runs testing code.
Interesting it that there is no big difference in OCR quality between tessdata_fast, tessdata and tessdata_best models (for this image).
import timeit
import time
import os
import pytesseract
start_time = time.time()
tess_exe = r"f:\Project-Personal\tesseract\build.clang_no_avx\bin\tesseract.exe"
test_image = r"f:\\Project-Personal\\tesseract.test\\i263_speed.jpg"
os.environ['TESSDATA_PREFIX'] = r"f:\Project-Personal\tessdata_best\tessdata"
code_to_test = """
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"{}"
pytesseract.pytesseract.image_to_string(r"{}", lang = 'eng')
"""
elapsed_time = timeit.timeit(code_to_test.format(tess_exe), number=5)/5
print("\nDuration:", elapsed_time)
The Linux kernel and kernel parameters also have a significant effect on the performance of Tesseract (both for recognition and training). Especially the first kernels which tried to fix Spectre and similar CPU bugs make it really slow. I recently noticed that Tesseract with Debian GNU Linux (testing / bullseye) is faster when running in the Linux subsystem for Windows. Running on a Linux kernel with the default settings is slightly slower than running on the Windows kernel.
With the kernel parameters from https://make-linux-fast-again.com/ Tesseract gets faster by about 10 to 20 % and is then faster than in the Linux subsystem for Windows.
@zdenop How to achieve AVX, AVX2, FMA or SSE optimization.
It is used automatically if your computer provides them.
For texts without inverted text, significant faster OCR is possible when tesseract
is called with -c tessedit_do_invert=0
, see timing results above.
Is it possible to set -c tessedit_do_invert=0
in runtime or do we need to build Tesseract with this option?
It's a runtime option:
tesseract in.png out -c tessedit_do_invert=0
Are you aware of whether or not the pytesseract has that option available?
I'm not familiar with pytesseract.
Are you aware of whether or not the pytesseract has that option available?
The answer is on the pytesseract homepage:
config String - Any additional custom configuration flags that are not available via the pytesseract function. For example: config='--psm 6'
Is there any way to use tesseract via multi threading on android project?
I managed to get faster results by upgrading Tesseract from 4.x to 5.x (can't remember the exact versions) Also found out that our production servers were using 32bit, so we installed the 64bits version instead. Time to analysis went from 20+ seconds to 7~10 which is perfectly acceptable since we also added 2 more servers.
Tesseract 5.0.0 should be faster than 4.1.x.
@zdenop, can you update your benchmarks above?
For the tessdata model, you can add two tests using just one of the ocr engines. test 1: oem 0 (legacy only), test 2: oem 1 (lstm only).
Timing test with lstm_squashed_test
on Debian bullseye, AMD EPYC 7413, Tesseract Git main, -O2
:
# clang, default kernel options, configure --disable-shared --disable-openmp
[ OK ] LSTMTrainerTest.TestSquashed (22778 ms)
[ OK ] LSTMTrainerTest.TestSquashed (22764 ms)
# g++, default kernel options, configure --disable-shared --disable-openmp
[ OK ] LSTMTrainerTest.TestSquashed (23722 ms)
[ OK ] LSTMTrainerTest.TestSquashed (23739 ms)
# clang, kernel options https://make-linux-fast-again.com/, configure --disable-shared --disable-openmp
[ OK ] LSTMTrainerTest.TestSquashed (22984 ms)
[ OK ] LSTMTrainerTest.TestSquashed (23062 ms)
# g++, kernel options https://make-linux-fast-again.com/, configure --disable-shared --disable-openmp
[ OK ] LSTMTrainerTest.TestSquashed (23834 ms)
[ OK ] LSTMTrainerTest.TestSquashed (23708 ms)
# clang, kernel options https://make-linux-fast-again.com/, configure --disable-shared
[ OK ] LSTMTrainerTest.TestSquashed (22844 ms)
[ OK ] LSTMTrainerTest.TestSquashed (22963 ms)
So with a recent Linux kernel "optimized" kernel options no longer seem to have an effect on the performance.
Nor does OpenMP
make that training test faster. It even has a huge negative effect because it consumes much CPU time:
# clang, kernel options https://make-linux-fast-again.com/, configure --disable-shared --disable-openmp
time ./lstm_squashed_test
[...]
real 0m23.114s
user 0m23.049s
sys 0m0.064s
# clang, kernel options https://make-linux-fast-again.com/, configure --disable-shared
time ./lstm_squashed_test
[...]
real 0m22.972s
user 1m31.495s
sys 0m0.308s
Using -O3
has no effect im my test, but adding -ffast-math
increases the performance further:
# clang, configure --disable-shared --disable-openmp
[ OK ] LSTMTrainerTest.TestSquashed (21793 ms)
For OpenMP, you can try to limit the number of threads it uses to n_cpu_cores-1
.
Edit: With your CPU, you can try to limit it to a small numbers of threads, let say 3, and then increase/decrease the number of threads.
The test was running on a CPU with 24 cores. Using more than one core always produces a huge waste of CPU time.
# OMP_THREAD_LIMIT=1
real 0m25.105s
user 0m25.048s
sys 0m0.056s
# OMP_THREAD_LIMIT=2
real 0m25.637s
user 0m51.032s
sys 0m0.188s
# OMP_THREAD_LIMIT=3
real 0m23.279s
user 1m9.493s
sys 0m0.288s
# OMP_THREAD_LIMIT=4 or larger
real 0m23.008s
user 1m31.521s
sys 0m0.348s
Using more than 1 CPU in the same address space has always coordination overhead and more than ~4 is a complete waste. Boxes with 24 CPUs are more made to run VMs on it. Something like 2 x 6C/6T serving 24 VMs and 400 websites works (with disk IO as the bottleneck).
For task with 100% CPU I would first profile them to find hotspots or low hanging fruits. Maybe change to the much faster TensorFlow. Are there benchmarks, how much faster Tensorflow is?
Tuning code itself is more time consuming and in case of well crafted code you can get maybe something in the range of 10%.
@amitdo : what about creating wiki related to speed? IMO it would be more appropriate than discussing/updating 5 years old thread...
@zdenop,
Wiki page or a page in tessdoc?
Benchmarks
?
Performace comparison
?
I started https://github.com/tesseract-ocr/tessdoc/blob/main/Benchmarks.md
Still missing several tests (4.1.3 with AWX, -c tessedit_do_invert=0
, maybe different OEM, OCR quality...)
Thanks Zdenko.
Conclusions:
@stweil
If many pages have to be processed, it is better to use single threaded Tesseract and run several Tesseract processes in parallel.
Same here. After updating to Ubuntu 22.04, gImageReader became incredibly slow for me. Dev manisandro was very helpful and led me to a quick and dirty cli solution for running on a single thread. 'Works wonderfully for me. In a terminal, type:
export OMP_THREAD_LIMIT=1
If you want to check that you actually are running on one thread, type:
echo $OMP_THREAD_LIMIT
Then run gImageReader:
gimagereader-gtk
Et voilà :o)
I integrated Tesseract C/C++, version 3.x, to read English OCR on images.
It’s working pretty good, but very slow. It takes close to 1000ms (1 second) to read the attached image (00060.jpg) on my quad-core laptop.
I’m not using the Cube engine, and I’m feeding only binary images to the OCR reader.
Any way to make it faster. Any ideas on how to make Tesseract read faster? thanks