tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
60.59k stars 9.33k forks source link

good accuracy but too slow, how to improve Tesseract speed #263

Closed ychtioui closed 2 years ago

ychtioui commented 8 years ago

I integrated Tesseract C/C++, version 3.x, to read English OCR on images.

It’s working pretty good, but very slow. It takes close to 1000ms (1 second) to read the attached image (00060.jpg) on my quad-core laptop.

I’m not using the Cube engine, and I’m feeding only binary images to the OCR reader.

Any way to make it faster. Any ideas on how to make Tesseract read faster? thanks 00060

stweil commented 8 years ago

You can already run 4 parallel instances of Tesseract on your quad core, then it will read 4 images in about the same time. Introducing multi threading would not help to reduce the time needed for an OCR of many images. I am working on a project where OCR with Tesseract would take nearly 7 years on a single core, but luckily I can try to get many computers and use their cores, so the time can be reduced to a few days. Using compiler settings which are optimized for your CPU helps to gain a few percent, but I am afraid that for a larger gain different algorithms in Tesseract and its libraries would be needed.

ychtioui commented 8 years ago

Besides the OCR, we have other things that need to run on the other cores. I believe, the main issue that's slowing down Tesseract is the way memory is managed. Too many memory allocations (new function) and releases (delete or delete [] functions) do slow down the reader. In the past, I did use a different OCR engine, and it was allocating up-front large buffers to store all the needed data (large buffer of blobs, a large buffer of lines, a large buffer of words and their corresponding data), the buffers were just being indexed as we were reading the data from an image. The large buffers were allocated only once upon ocr engine initialization and release only once upon ocr engine shutdown. This memory management scheme was very efficient computational-time-wise. Are there any settings for Tesseract that are known to be computationally intensive? any tricks to speed up Tesseract?

tfmorris commented 8 years ago

What evidence is your memory management speculation based on?

ychtioui commented 8 years ago

I'm not speculating anything. The reality is that TesseRact takes more than 3 seconds to read the above image that I initially attached (I use VS2010). When I use the console test application that comes with the TesseRact, it takes about the same time (more than 3 seconds).

Anyone would speculate a lot in 3 seconds

I have more than 20 years in machine vision. I used several OCR engines in the past. Actually I have one -in house- that reads the same image in less than 100ms, but our engine is designed more for reading a single line of text (i.e. it returns a single line of text).

TesseRact database is not that large. Most of the techniques used by TesseRact are quite standard in the OCR-area (page layout, line extraction, possible character extraction, word forming, and then several phases of classification). However, the TesseRact manages very badly memory usage. why? it takes more than 3 seconds to read a typical texted-image.

please if you're not bringing any meaningful ideas to my posting, just spare me your comment.

stweil commented 8 years ago

@ychtioui, as you have spent many years in machine vision, you know quite well that there are lots of ways why programs can be slow. Memory management is just one of them. Even with a lot of experience, I'd start running performance analyzers to investigate performance issues. Of course I can guess what might be possible reasons and try to improve the software based on that guesses, but improvements based on evidence (like the result of a performance analysis) are more efficient. Don't you think so, too? Do you have a chance to run a performance analysis?

zdenop commented 8 years ago

You can try to use 3.02 version if you need only English. AFAIR it was singnificantly faster on my (old) computer.

Zdenko

On Thu, Mar 10, 2016 at 4:35 PM, younes notifications@github.com wrote:

I integrated Tesseract C/C++, version 3.x, to read English OCR on images.

It’s working pretty good, but very slow. It takes close to 1000ms (1 second) to read the attached image (00060.jpg) on my quad-core laptop.

I’m not using the Cube engine, and I’m feeding only binary images to the OCR reader.

Any way to make it faster. Any ideas on how to make Tesseract read faster? thanks [image: 00060] https://cloud.githubusercontent.com/assets/9968625/13674495/ac261db4-e6ab-11e5-9b4a-ad91d5b4ff87.jpg

— Reply to this email directly or view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/263.

ychtioui commented 8 years ago

I'm running version 3.02 I'm going through different sections of the reader, and checking which section is taking the most time.

is it typical to read images (such as mine attached above) in a few seconds?

thanks for your comments.

amitdo commented 8 years ago

... 3.02 version ... AFAIR it was significantly faster on my (old) computer.

3.02 3.02.02 is compiled with '-O3' by default. https://github.com/tesseract-ocr/tesseract/blob/3.02.02/configure.ac#L161

3.03 and 3.04 are compiled with '-O2' by default. https://github.com/tesseract-ocr/tesseract/blob/3.03-rc1/configure.ac#L201 https://github.com/tesseract-ocr/tesseract/blob/3.04.01/configure.ac#L300

2.04 and 3.01 are compiled with '-O0' '-O2' by default. https://github.com/tesseract-ocr/tesseract/blob/2.04/configure.ac https://github.com/tesseract-ocr/tesseract/blob/3.01/configure.ac The 'configure.ac' script in these versions does not explicitly set the '-O' level, so autotools will use '-O0' '-O2' as default.

ychtioui commented 8 years ago

thanks amitdo. I'm using 3.02 but the C/C++ version of Tesseract. I couldn't find the setting -O3 in the source files. where is it?

amitdo commented 8 years ago

What I linked to was actually 3.02.02

I think this is 3.02: https://github.com/tesseract-ocr/tesseract/blob/d581ab7e12a2fac4a73ac0af4ce7ec522b8f3e42/configure.ac

You are right. It does not contain any '-On' flag, so the compiler will use '-O0', which is not good for speed. so if you are using autotools to build Tesseract it will instruct the compiler to use '-O2'.

amitdo commented 8 years ago

I assume you are using Tesseract on Linux / FreeBSD / Mac. On Windows + MS Visual C++ the configure.ac file is irrelevant.

Shreeshrii commented 8 years ago

@ychtioui said in a post above "I use VS2010" so using Windows.

amitdo commented 8 years ago

Thanks Shree.

I don't know which optimization level is used for Visual C++.

ychtioui commented 8 years ago

I use vs2010 on a Windows 7 pc. Project settings or building options won't change much the read speed. Tesseract was designed in research labs. Most of the key sections of the reader are speed-don't-care. I used some performance tools to analyze where most of the computation time is spent. In the page layout section, the blob analyzer does a lot of new/delete. This is very time consuming. The attached image above has more than 3600 blobs. Besides a number of processings are done on each blob (distance transform, finding the enclosing rectangle, measuring blob parameters, etc.). The allocations (new) and the release (delete) of all these blobs is very time consuming. If we use a global array (allocate upfront) of blobs (exactly object BLOBNBOX) and whenever we need a blob, just get one index from the array. The array will be released once when we shut down the engine. I used this concept in another single line ocr reader and it's super fast.

zdenop commented 8 years ago

VS2010 use optimization flag /O2 (Maximize speed) - other flags are set to default. In past in forum there were warnings against using compiler optimization flag as they affect also OCR results. This is reason why there are standard optimization flags (-O2 in autotools and /O2 in VS).

I tried to run perf tool on linux: perf record tesseract eurotext.tif eurotext and I got this report (perf report):

  39,77%  tesseract  libtesseract.so.3.0.4  [.] tesseract::SquishedDawg::edge_char_of
  13,98%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::ComputeCharNormArrays
  13,09%  tesseract  libtesseract.so.3.0.4  [.] IntegerMatcher::UpdateTablesForFeature
   4,22%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::PruneClasses
   2,66%  tesseract  libtesseract.so.3.0.4  [.] ScratchEvidence::UpdateSumOfProtoEvidences
   1,48%  tesseract  libtesseract.so.3.0.4  [.] ELIST_ITERATOR::forward
   1,16%  tesseract  libc-2.19.so           [.] _int_malloc
   1,15%  tesseract  libtesseract.so.3.0.4  [.] tesseract::ShapeTable::MaxNumUnichars
   1,01%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::ExpandShapesAndApplyCorrections
   0,87%  tesseract  liblept.so.5.0.0       [.] rasteropLow
   0,79%  tesseract  libm-2.19.so           [.] __mul
   0,72%  tesseract  libtesseract.so.3.0.4  [.] FPCUTPT::assign
   0,71%  tesseract  libc-2.19.so           [.] _int_free
   0,71%  tesseract  libtesseract.so.3.0.4  [.] ELIST::add_sorted_and_find
   0,61%  tesseract  libtesseract.so.3.0.4  [.] tesseract::AmbigSpec::compare_ambig_specs
   0,57%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::ComputeNormMatch
   0,52%  tesseract  libc-2.19.so           [.] memset
   0,49%  tesseract  libc-2.19.so           [.] vfprintf
   0,45%  tesseract  libc-2.19.so           [.] malloc
   0,36%  tesseract  libtesseract.so.3.0.4  [.] SegmentLLSQ
   0,31%  tesseract  libm-2.19.so           [.] __ieee754_atan2_sse2
   0,31%  tesseract  libc-2.19.so           [.] malloc_consolidate
   0,30%  tesseract  libtesseract.so.3.0.4  [.] LLSQ::add
   0,29%  tesseract  libtesseract.so.3.0.4  [.] GenericVector<tesseract::ScoredFont>::operator+=
   0,29%  tesseract  libtesseract.so.3.0.4  [.] _ZN14ELIST_ITERATOR7forwardEv@plt
   0,28%  tesseract  libtesseract.so.3.0.4  [.] tesseract::ComputeFeatures
   0,25%  tesseract  liblept.so.5.0.0       [.] pixScanForForeground
   0,24%  tesseract  libtesseract.so.3.0.4  [.] GenericVector<tesseract::ScoredFont>::reserve
   0,20%  tesseract  libtesseract.so.3.0.4  [.] C_OUTLINE::increment_step
   0,20%  tesseract  [kernel.kallsyms]      [k] clear_page

according this report 3 top function consumed 66% of "time".

Then I tried 4 pages (A4 ) tiff (G4 compressed):

  52,24%  tesseract  libtesseract.so.3.0.4  [.] tesseract::SquishedDawg::edge_char_of
  12,06%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::ComputeCharNormArrays
  10,06%  tesseract  libtesseract.so.3.0.4  [.] IntegerMatcher::UpdateTablesForFeature
   3,57%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::PruneClasses
   1,90%  tesseract  libtesseract.so.3.0.4  [.] ScratchEvidence::UpdateSumOfProtoEvidences
...

Then I tried non eng image: perf record tesseract hebrew.png hebrew -l heb:

  27,79%  tesseract  libtesseract.so.3.0.4  [.] IntegerMatcher::UpdateTablesForFeature
  27,34%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::ComputeCharNormArrays
   4,40%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::PruneClasses
   3,98%  tesseract  libtesseract.so.3.0.4  [.] ScratchEvidence::UpdateSumOfProtoEvidences
   3,05%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::ComputeNormMatch
   2,36%  tesseract  libtesseract.so.3.0.4  [.] tesseract::ShapeTable::MaxNumUnichars
   2,05%  tesseract  libtesseract.so.3.0.4  [.] tesseract::Classify::ExpandShapesAndApplyCorrections
...
zdenop commented 7 years ago

Just for record for possible improvement in this issue: there was interesting information posted in scantailor project: OpenCL alone only brings ~2x speed-up. Another ~6x speed-up comes from multi-threaded processing.

anant-pathak commented 7 years ago

Hi @ychtioui I am newbie and saw your first comment that you are able to get pretty accurate results from Tesseract. For your image itself i am no table to get any results its telling: Can't recognize image. Can you plz provide the code snippet on how you are processing the image. Thanks - Anant.

amitdo commented 7 years ago

@theraysmith What do you use in the internal Google build, -O2 or -O3?

paladini commented 7 years ago

I'm interested in the same answer, @amitdo . Can you answer the question, @theraysmith ? It really can help us :)

stweil commented 7 years ago

Don't expect much difference between -O2 and -O3. I tried different optimizations, and they only have small effects on the time needed for OCR of a page. Higher optimization levels can even result in slower code because the code gets larger (because of unfolding of loops), so CPU caches become less effective. It is much more important to write good code.

theraysmith commented 7 years ago

That is a surprisingly hard question to answer in the Google environment!

I use 'opt' mode which after some digging, I found maps to -O2. In addition, explicitly added are: -fopenmp which will deliver a major improvement (3x faster), if you do not have it, and a corresponding -lgomp for the linker arch/dotproductavx.cpp is compiled with -mavx arch/dotproductsse.cpp (and actually all the rest of the code) is compiled with -msse4.1

I thought all this stuff was in the autotools files already, or are you looking to convert these to windows?

On Sat, Apr 8, 2017 at 10:50 AM, Stefan Weil notifications@github.com wrote:

Don't expect much difference between -O2 and -O3. I tried different optimizations, and they only have small effects on the time needed for OCR of a page. Higher optimization levels can even result in slower code because the code gets larger (because of unfolding of loops), so CPU caches become less effective. It is much more important to write good code.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/263#issuecomment-292734412, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056Qbi9xKk5GXQtfgXVZajN10mksEUks5rt8j6gaJpZM4Ht19x .

-- Ray.

stweil commented 7 years ago

The improvement by using -fopenmp is useful when you want "realtime" OCR – running OCR for a single page and waiting for the result. Then it is fast because it uses more than one CPU core for some time consuming parts of the OCR process.

For mass OCR, it does not help. If many pages have to be processed, it is better to use single threaded Tesseract and run several Tesseract processes in parallel.

amitdo commented 7 years ago

Stefan, what about using OpenMP for training?

stweil commented 7 years ago

Yes, for training a single new model OpenMP could perhaps speed up the training process. Up to now, OpenMP is only used in ccmain/ and in lstm/. I don't know how much that part is used during training, and I never have run a performance evaluation for the training process (in fact I‌ have only run LSTM training once for Fraktur, and as I already said, it was not really successful).

theraysmith commented 7 years ago

OpenMP speeds up training by about 3.5x, since it runs 4 threads (one for each part of the LSTM) and spends >90% of CPU time computing the LSTM forward/backward.

On Sat, Apr 15, 2017 at 7:11 AM, Stefan Weil notifications@github.com wrote:

Yes, for training a single new model OpenMP could perhaps speed up the training process. Up to now, OpenMP is only used in ccmain/ and in lstm/. I don't know how much that part is used during training, and I never have run a performance evaluation for the training process (in fact I‌ have only run LSTM training once for Fraktur, and as I already said, it was not really successful).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/263#issuecomment-294295776, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056QxUeSroEmcJmZ30om3_wi6Mlyu5ks5rwNAogaJpZM4Ht19x .

-- Ray.

xlight commented 7 years ago

can I set more than 4 threads for Trainning LSTM?

theraysmith commented 7 years ago

No, it doesn't help. The parallelism is limited by the implementation of the LSTM as 4 matrix-vector products. When I experimented with more threads for some of the other operations (eg the output softmax), it slowed down because the cache coherency was lost. I also experimented with breaking the matrix-vector products up further (eg splitting the input from the recurrent part), but openMP doesn't seem too good at allocating the threads in a way that keeps the cache coherency. Each thread needs to run the same part of the weights matrix for each timestep, and that is difficult to achieve with the recurrent nature of the LSTM.

On Tue, Apr 18, 2017 at 11:09 PM, xlight notifications@github.com wrote:

can I set more than 4 threads for Trainning LSTM?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/263#issuecomment-295112242, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056UAEnzbmZZ5vncaO2zr0ASll1IoCks5rxaUjgaJpZM4Ht19x .

-- Ray.

amitdo commented 7 years ago

What about machines that have only 2 cores? Shouldn't the 'num_threads' lowered to 2 in that case?

theraysmith commented 7 years ago

It still works. It just takes longer.

On Wed, Apr 19, 2017 at 10:00 AM, Amit D. notifications@github.com wrote:

What about machine that have only 2 cores? Shouldn't the 'num_threads' lowered to 2 in that case?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/263#issuecomment-295345495, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056dmv_0xhpF-2Qt11PJbfyg5Z-Bepks5rxj26gaJpZM4Ht19x .

-- Ray.

hanikh commented 7 years ago

@theraysmith I want to train tesseract 4 for arabic language. theraysmith you mean that there is no way to speed up the training process?

ShounakCy commented 6 years ago

"I have more than 20 years in machine vision. I used several OCR engines in the past. Actually I have one -in house- that reads the same image in less than 100ms, but our engine is designed more for reading a single line of text (i.e. it returns a single line of text)."

In the context of your message, I have to read a single line and it still takes 1 sec in processing. How did you minimise the processing speed?

Thanks

Shreeshrii commented 6 years ago

@zdenop Please label

Performance

ychtioui commented 6 years ago

ShounakCy Our in-house ocr reader is super fast in reading single lines of multi-fonts. It's proprietary (not open-source). Tesseract 4.x is much accurate than 3.x since it uses Neural Networks. I believe the key to improving Tesseract Speed is to use OpenCL.

ShounakCy commented 6 years ago

What is the cost of yourTesseract 4.x, and I would like to integrate the same in our Python or C# code.

Thanks

On Wed, Apr 25, 2018 at 8:14 PM, younes notifications@github.com wrote:

ShounakCy Our in-house ocr reader is super fast in reading single lines of multi-fonts. It's proprietary (not open-source). Tesseract 4.x is much accurate than 3.x since it uses Neural Networks. I believe the key to improving Tesseract Speed is to use OpenCL.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/263#issuecomment-384312916, or mute the thread https://github.com/notifications/unsubscribe-auth/AjPGGKFPMzQuEvkN8CxGVtWf-6jyWpqbks5tsIvjgaJpZM4Ht19x .

MattyCi commented 6 years ago

Hi, sorry if this is the wrong place to ask, but how are some users achieving very fast speeds compared to what I am getting? It takes me close to 4 seconds to run the OPs image. This user seems to run a 6 page PDF through tesseract in a matter of seconds, whereas it takes me minutes to run through that many pages of similar text. I have a Ryzen 3 1200 and 8 GB RAM. I have installed versions 3.02, 3.04, 3.05, and 4.00 with all the same results.

zdenop commented 6 years ago

Yes, this is wrong place to post questions. As you can see that user is using version provided by is distribution his speed it related to:

AbdelsalamHaa commented 6 years ago

I'm using tesseract 3.04 with ara.traineddata of course i also use the cube files , to initialize the file it taks too much time , it takes from me 15 min just to initialize , any idea how to improve that

im using visual studio 2013

Shreeshrii commented 6 years ago

Please try with latest 4.0 beta with tessdata_fast files.

On Thu 10 May, 2018, 8:51 AM AbdelsalamHaa, notifications@github.com wrote:

I'm using tesseract 3.04 with ara.traineddata of course i also use the cube files , to initialize the file it taks too much time , it takes from me 15 min just to initialize , any idea how to improve that

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/263#issuecomment-387940280, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o1m6TIxBTlSZTvTxqkeLHI6NCtyZks5tw7JQgaJpZM4Ht19x .

AbdelsalamHaa commented 6 years ago

i have tested 4.0 it's very good and fast the reason why im using 3.04 is due to i have so many other libraries build in 2013 visual studio , and tesseract 4 doesn't not support in 2013 vs . means if i want to use 4.0 i have to rebuild all the libraries again.

if u have any suggestion please let me know

SandeepShaw2017 commented 6 years ago

I am also having similar issue .... am having more than 50K data .... I ran ocr and it took 12 hours to process only 1000 pdf .... how to make tessaract fast .... can using hadoop make it fast

raffopazzo commented 6 years ago

Have you tried OMP_NUM_THREADS=1 tesseract ... as described in https://github.com/tesseract-ocr/tesseract/issues/898 ?

SandeepShaw2017 commented 6 years ago

How to use "OMP_NUM_THREADS=1 tesseract" in R

amitdo commented 6 years ago

Have you tried OMP_NUM_THREADS=1 tesseract ... as described in #898 ?

OMP_NUM_THREADS=1 will have no impact.

https://github.com/tesseract-ocr/tesseract/issues/898#issuecomment-300549643

Something that DOES work: https://github.com/tesseract-ocr/tesseract/issues/898#issuecomment-315202167

raffopazzo commented 6 years ago

@amitdo oops I copied from the wrong comment. Indeed OMP_THREAD_LIMIT=1 tesseract... is what worked for me

SandeepShaw2017 commented 6 years ago

I am still not clear how to improve the speed .... my code is in R and I used ocr function .... where should I use "OMP_THREAD_LIMIT=1 tesseract..."

raffopazzo commented 6 years ago

@SandeepShaw2017 I'm not sure I can help. I don't know much about R so I can only give some general advise: if you are calling tesseract's functions directly from your R code, then maybe you have to set it when running your own app, e.g. from command line OCR_THREAD_LIMIT=1 ./my-R-script or via some System.setEnv("OCR_THREAD_LIMIT", 1); If you use tesseract as an application that your R code executes (eg via System.exec() or something), then you need to set the environment variable OCR_THREAD_LIMIT=1 for that process in whatever way R does it, or maybe via the same method as the former case if the child process inherits the environment variables. You should do your own googling, this seems to be an R-specific issue rather than tesseract's.

SandeepShaw2017 commented 6 years ago

Setting Sys.setenv(OMP_THREAD_LIMIT= 1) is still taking more than 20 sec ..... can processing in R hadoop rmr2 help to reduce process time

amitdo commented 6 years ago

Use multi-threading in your application. Initialize N instances of TessBaseAPI. N should be the number of CPU cores. Each instance should handle a different image.

SandeepShaw2017 commented 6 years ago

Dear Amit .... i am having 4 cores ... so does that mean I will be using the ocr tool in 4 consoles of RStudio ....

amitdo commented 6 years ago

I don't know R. Just try and see.