tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
60.92k stars 9.37k forks source link

good accuracy but too slow, how to improve Tesseract speed #263

Closed ychtioui closed 2 years ago

ychtioui commented 8 years ago

I integrated Tesseract C/C++, version 3.x, to read English OCR on images.

It’s working pretty good, but very slow. It takes close to 1000ms (1 second) to read the attached image (00060.jpg) on my quad-core laptop.

I’m not using the Cube engine, and I’m feeding only binary images to the OCR reader.

Any way to make it faster. Any ideas on how to make Tesseract read faster? thanks 00060

stweil commented 6 years ago

Adding more consoles which run R with Tesseract until all CPU cores are fully used is one way how you can get maximum throughput.

WaltPeter commented 6 years ago

Hi, I have a similar but slightly different problem here. I am using Python 3.7 with Tesseract 3.02. And I am new to Tesseract. I used pytesseract.image_to_string function, and it took me a long duration on the "first run".

result for first run:

'Cuz my associate professor 
at college advises the club. 

Duration: 259.72785544395447

result for second run:

'Cuz my associate professor
at college advises the club.

Duration: 0.9130520820617676

Can anyone please explain to me why will it happened? This is the 2nd day I am using Tesseract. Thank you.


complete python code:

from PIL import Image
import time 
pytesseract.pytesseract.tesseract_cmd = "C:/Program Files (x86)/Tesseract-OCR/tesseract.exe"

start_time = time.time()

img = Image.open("Pic5.png")
print("start")
result = pytesseract.image_to_string(img, config='--tessdata-dir "C:/Program Files (x86)/Tesseract-OCR/tessdata"')
print(result)

duration = time.time() - start_time
print("\nDuration:", duration)
stweil commented 6 years ago

@WaltPeter, you are obviously running on Windows, so anything can happen in the background and delay your test, for example AV scans, disk defragmentation or software updates. Try running your test many times to see how times vary.

A Python program name.py is compiled at the first run into a name.pyc, but that should not take more than a second. You can remove all *.pyc files to force a new compilation.

burinov commented 6 years ago

So, guys... How speed things up? Any practical ideas?

Wesley-Li commented 5 years ago

I get the same issue with Tesseract 4.0.0 beta upon my Centos 7.3 setup. It takes 0.91 second to detect one character. Anything updated for this issue?

dagnelies commented 5 years ago

Just a detail, but I recommend using OMP_THREAD_LIMIT=1 so that tesseract runs in single thread mode.

By default, tesseract runs in multithread mode but apparently this just burns out CPU cycles without benefits. Here is an example on a 4 cores machine:

root@ubuntu-16gb-nbg1-1:/fv# export OMP_THREAD_LIMIT=1
root@ubuntu-16gb-nbg1-1:/fv# time tesseract 2.tif 2.txt
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Page 1
...
Page 12

real    0m34.300s
user    0m33.682s
sys     0m0.617s
root@ubuntu-16gb-nbg1-1:/fv# export OMP_THREAD_LIMIT=4
root@ubuntu-16gb-nbg1-1:/fv# time tesseract 2.tif 2.txt
Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica
Page 1
...
Page 12

real    0m31.943s
user    1m19.374s
sys     0m1.346s

Consumes three times more CPU while not even 10% faster.

stweil commented 5 years ago

Yes, I can confirm that. For mass production I'd even build an executable without OpenMP support (configure --disable-openmp ...) to remove the remaining overhead.

tfmorris commented 5 years ago

Sounds like we should change the build defaults if OpenMP is providing no real benefit.

dagnelies commented 5 years ago

I have no idea how the multithreading takes place but I have a feeling it's too low level, resulting in more overhead than gains. If the document's pages as a whole would be processed in parallel, that would probably be a real boost!

zdenop commented 5 years ago

I turn off default openmp usage for cmake. Patch for autotools is welcomed (I have no possibility to get to my linux machine soon).

noyessie commented 5 years ago

I'm not speculating anything. The reality is that TesseRact takes more than 3 seconds to read the above image that I initially attached (I use VS2010). When I use the console test application that comes with the TesseRact, it takes about the same time (more than 3 seconds).

Anyone would speculate a lot in 3 seconds

I have more than 20 years in machine vision. I used several OCR engines in the past. Actually I have one -in house- that reads the same image in less than 100ms, but our engine is designed more for reading a single line of text (i.e. it returns a single line of text).

TesseRact database is not that large. Most of the techniques used by TesseRact are quite standard in the OCR-area (page layout, line extraction, possible character extraction, word forming, and then several phases of classification). However, the TesseRact manages very badly memory usage. why? it takes more than 3 seconds to read a typical texted-image.

please if you're not bringing any meaningful ideas to my posting, just spare me your comment.

Hi @ychtioui ,

i am in the same case as you. I have many single line text images and i want to know if you can suggest me a fast and good OCR like the OCR you specify in example.

Thank in advance.

sirius0503 commented 4 years ago

Use multi-threading in your application. Initialize N instances of TessBaseAPI. N should be the number of CPU cores. Each instance should handle a different image.

Can you explain this @amitdo

sirius0503 commented 4 years ago

Yes, I can confirm that. For mass production I'd even build an executable without OpenMP support (configure --disable-openmp ...) to remove the remaining overhead.

@stweil Apart from disabling openmp, would you suggest any other changes to increase speed.

stweil commented 4 years ago

We noticed some time ago that the Linux kernel version can have a huge effect on the OCR performance, namely more than 20 % slower in some versions because of workarounds for SPECTRE and MELTDOWN. Those workarounds can be disabled using kernel parameters. I expect similar effects for other operating systems, too.

sirius0503 commented 4 years ago

@stweil : I'd like to ask whether it is viable in a cloud environment, since I'd be deploying on cloud, and I don't know whether it can be done or not? Also, referencing zdenko's comment on opencl opencl for more speed , while doing ./configure --help, I noticed --enable-opencl enable opencl build [default=no], do you think that would help too?

stweil commented 4 years ago

My personal experience is that Tesseract runs best on real hardware, virtual machines / cloud environments are often slower. There is initial experimental support for OpenCL in the Tesseract code, but as it is only initial and experimental, I cannot recommend it unless you want to work on improving it. You won't see better performance with the current code.

sirius0503 commented 4 years ago

I didn't get this one in the last comment, but there is also --with-tensorflow support TensorFlow [default=check] optional package, and I am guessing that it has to do with the lstm network, but is it for cuda based gpu usage?

stweil commented 4 years ago

Indeed that would be another way to get faster OCR, but it requires special traineddata model files for Tensorflow. As far as I know nobody has ever created such a file and used it with Tesseract + Tensorflow.

zdenop commented 4 years ago

I just run some simple speed tests in python with current code (Intel Core i7-6600U CPU @ 2.60GHz, 2801 MHz, Cores: 2, Threads: 4; Windows 64 bit) and here are results:

Optimization tessdata_best tessdata_fast tessdata
None 48.9555 8.0645 13.3477
AVX, AVX2, FMA, SSE 19.0863 3.3139 4.9020
Improvement None/AVX 156% 143% 172%
Additional:
None + no_invert 35.4278 5.0341 11.4808
AVX, AVX2, FMA, SSE + no_invert 13.8921 2.7461 3.6696
Improvement AVX/AVX no_invert 37% 21% 34%

UPDATE 2019-10-06: recent tesseract code allows to use option "-c tessedit_do_invert=0" which brings extra speed.

I used image from this issue, eng lang, no openmp, without specifying any parameter (e.g. default oem, psm...), duration is calculated as arithmetic average of 5 runs testing code.

Interesting it that there is no big difference in OCR quality between tessdata_fast, tessdata and tessdata_best models (for this image).

import timeit
import time
import os
import pytesseract

start_time = time.time()
tess_exe = r"f:\Project-Personal\tesseract\build.clang_no_avx\bin\tesseract.exe"
test_image = r"f:\\Project-Personal\\tesseract.test\\i263_speed.jpg"
os.environ['TESSDATA_PREFIX'] = r"f:\Project-Personal\tessdata_best\tessdata"

code_to_test = """
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"{}"
pytesseract.pytesseract.image_to_string(r"{}", lang = 'eng')
"""

elapsed_time = timeit.timeit(code_to_test.format(tess_exe), number=5)/5
print("\nDuration:", elapsed_time)
stweil commented 4 years ago

The Linux kernel and kernel parameters also have a significant effect on the performance of Tesseract (both for recognition and training). Especially the first kernels which tried to fix Spectre and similar CPU bugs make it really slow. I recently noticed that Tesseract with Debian GNU Linux (testing / bullseye) is faster when running in the Linux subsystem for Windows. Running on a Linux kernel with the default settings is slightly slower than running on the Windows kernel.

With the kernel parameters from https://make-linux-fast-again.com/ Tesseract gets faster by about 10 to 20 % and is then faster than in the Linux subsystem for Windows.

PratapMehra commented 4 years ago

@zdenop How to achieve AVX, AVX2, FMA or SSE optimization.

stweil commented 4 years ago

It is used automatically if your computer provides them.

stweil commented 4 years ago

For texts without inverted text, significant faster OCR is possible when tesseract is called with -c tessedit_do_invert=0, see timing results above.

ViniciusLelis commented 4 years ago

Is it possible to set -c tessedit_do_invert=0 in runtime or do we need to build Tesseract with this option?

amitdo commented 4 years ago

It's a runtime option:

tesseract in.png out -c tessedit_do_invert=0

ViniciusLelis commented 4 years ago

Are you aware of whether or not the pytesseract has that option available?

amitdo commented 4 years ago

I'm not familiar with pytesseract.

stweil commented 4 years ago

Are you aware of whether or not the pytesseract has that option available?

The answer is on the pytesseract homepage:

config String - Any additional custom configuration flags that are not available via the pytesseract function. For example: config='--psm 6'

skydev66 commented 3 years ago

Is there any way to use tesseract via multi threading on android project?

ViniciusLelis commented 3 years ago

I managed to get faster results by upgrading Tesseract from 4.x to 5.x (can't remember the exact versions) Also found out that our production servers were using 32bit, so we installed the 64bits version instead. Time to analysis went from 20+ seconds to 7~10 which is perfectly acceptable since we also added 2 more servers.

amitdo commented 2 years ago

Tesseract 5.0.0 should be faster than 4.1.x.

@zdenop, can you update your benchmarks above?

For the tessdata model, you can add two tests using just one of the ocr engines. test 1: oem 0 (legacy only), test 2: oem 1 (lstm only).

stweil commented 2 years ago

Timing test with lstm_squashed_test on Debian bullseye, AMD EPYC 7413, Tesseract Git main, -O2:

# clang, default kernel options, configure --disable-shared --disable-openmp
[       OK ] LSTMTrainerTest.TestSquashed (22778 ms)
[       OK ] LSTMTrainerTest.TestSquashed (22764 ms)

# g++, default kernel options, configure --disable-shared --disable-openmp
[       OK ] LSTMTrainerTest.TestSquashed (23722 ms)
[       OK ] LSTMTrainerTest.TestSquashed (23739 ms)

# clang, kernel options https://make-linux-fast-again.com/, configure --disable-shared --disable-openmp
[       OK ] LSTMTrainerTest.TestSquashed (22984 ms)
[       OK ] LSTMTrainerTest.TestSquashed (23062 ms)

# g++, kernel options https://make-linux-fast-again.com/, configure --disable-shared --disable-openmp
[       OK ] LSTMTrainerTest.TestSquashed (23834 ms)
[       OK ] LSTMTrainerTest.TestSquashed (23708 ms)

# clang, kernel options https://make-linux-fast-again.com/, configure --disable-shared
[       OK ] LSTMTrainerTest.TestSquashed (22844 ms)
[       OK ] LSTMTrainerTest.TestSquashed (22963 ms)

So with a recent Linux kernel "optimized" kernel options no longer seem to have an effect on the performance. Nor does OpenMP make that training test faster. It even has a huge negative effect because it consumes much CPU time:

# clang, kernel options https://make-linux-fast-again.com/, configure --disable-shared --disable-openmp
time ./lstm_squashed_test
[...]
real    0m23.114s
user    0m23.049s
sys 0m0.064s

# clang, kernel options https://make-linux-fast-again.com/, configure --disable-shared
time ./lstm_squashed_test
[...]
real    0m22.972s
user    1m31.495s
sys 0m0.308s

Using -O3 has no effect im my test, but adding -ffast-mathincreases the performance further:

# clang, configure --disable-shared --disable-openmp
[       OK ] LSTMTrainerTest.TestSquashed (21793 ms)
amitdo commented 2 years ago

For OpenMP, you can try to limit the number of threads it uses to n_cpu_cores-1.

Edit: With your CPU, you can try to limit it to a small numbers of threads, let say 3, and then increase/decrease the number of threads.

stweil commented 2 years ago

The test was running on a CPU with 24 cores. Using more than one core always produces a huge waste of CPU time.

# OMP_THREAD_LIMIT=1
real    0m25.105s
user    0m25.048s
sys 0m0.056s

# OMP_THREAD_LIMIT=2
real    0m25.637s
user    0m51.032s
sys 0m0.188s

# OMP_THREAD_LIMIT=3
real    0m23.279s
user    1m9.493s
sys 0m0.288s

# OMP_THREAD_LIMIT=4 or larger
real    0m23.008s
user    1m31.521s
sys 0m0.348s
wollmers commented 2 years ago

Using more than 1 CPU in the same address space has always coordination overhead and more than ~4 is a complete waste. Boxes with 24 CPUs are more made to run VMs on it. Something like 2 x 6C/6T serving 24 VMs and 400 websites works (with disk IO as the bottleneck).

For task with 100% CPU I would first profile them to find hotspots or low hanging fruits. Maybe change to the much faster TensorFlow. Are there benchmarks, how much faster Tensorflow is?

Tuning code itself is more time consuming and in case of well crafted code you can get maybe something in the range of 10%.

zdenop commented 2 years ago

@amitdo : what about creating wiki related to speed? IMO it would be more appropriate than discussing/updating 5 years old thread...

amitdo commented 2 years ago

@zdenop,

Wiki page or a page in tessdoc?

Benchmarks ? Performace comparison ?

zdenop commented 2 years ago

I started https://github.com/tesseract-ocr/tessdoc/blob/main/Benchmarks.md

Still missing several tests (4.1.3 with AWX, -c tessedit_do_invert=0, maybe different OEM, OCR quality...)

amitdo commented 2 years ago

Thanks Zdenko.

amitdo commented 2 years ago

Conclusions:

Freredaran commented 1 year ago

@stweil

If many pages have to be processed, it is better to use single threaded Tesseract and run several Tesseract processes in parallel.

Same here. After updating to Ubuntu 22.04, gImageReader became incredibly slow for me. Dev manisandro was very helpful and led me to a quick and dirty cli solution for running on a single thread. 'Works wonderfully for me. In a terminal, type:

export OMP_THREAD_LIMIT=1

If you want to check that you actually are running on one thread, type:

echo $OMP_THREAD_LIMIT

Then run gImageReader:

gimagereader-gtk

Et voilà :o)