tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
60.91k stars 9.37k forks source link

Windows Build 577% Slower than Linux Build #1307

Closed leemorton closed 5 years ago

leemorton commented 6 years ago

Environment

Current Behavior:

Identical machine spec with identical workload and tesseract configuration results in consistent 577% slower performance on Windows 10 x64 compared with Debian Stretch x64. Essentially the job takes averagely 18 seconds on the Linux build, and 1 minute 44 seconds on the win build. Has been tested on other machines and fresh installations.

Expected Behavior:

Significantly less than 577% difference in performance.

What could be causing the win build to experience that level of overhead...?

stweil commented 6 years ago

Could you please repeat your test with environment variable OMP_THREAD_LIMIT=1 (see https://github.com/tesseract-ocr/tesseract/issues/1081) and report the results?

I expect the difference will be much smaller then. Windows multithreading is not performing very good. For a single threaded Tesseract there should be nearly no difference because the code was generated by the same kind of compiler (gcc) in both cases.

Which version is UB-Mannheim/4.00.00alpha? The latest is tesseract-ocr-setup-4.0.0-alpha.20180109.exe, did you use that one?

leemorton commented 6 years ago

Gave it a try with OMP_THREAD_LIMIT=1, also then added OMP_NUM_THREADS = 1. Best it came up with with was 2 minutes 48 seconds. Took the environment variables away again and got 1 minute 35 seconds. With or without those settings, task manager shows all the logical processors spiking, however they are much more erratically spiked with OMP_THREAD_LIMIT=1 and more consistently high with no dips without that.

tesseract-ocr-setup-4.0.0-alpha.20180109.exe is the version in use.

I also seem to get this error at the end of OCRing with or without those environment variables, doubt its related but... Detected 32 diacritics contains_unichar_id(unichar_id):Error:Assert failed:in file ../../../../ccutil/unicharset.h, line 513

stweil commented 6 years ago

For further investigation more information is needed. Could you provide your test image somewhere? Which traineddata do you use? How does the command line look like?

leemorton commented 6 years ago

Here is the command: tesseract -l eng -c include_page_breaks=1 --psm 1 --oem 3 "in.multipage" "out" hocr tsv

Unfortunately I cant provide the exact images from this example as they contain personal details. But they are 12 PNGs all 2480x3508 pixels @ 72ppi & 8 bit depth. They are PNGs generated by ImageMagick from a PDF. The performance issue is experienced on all PNGs however (converted from any other format), not just this document.

I am using the original tessdata provided with tesseract-ocr-setup-4.0.0-alpha.20180109.exe

stweil commented 6 years ago

Thanks for that information. "Original tessdata" means that you are using eng.traineddata from https://github.com/tesseract-ocr/tessdata/. That model supports two different OCR engines (old and LSTM), and with --oem 3 you implicitly selected the LSTM engine. The tesseract-ocr package which is part of Debian Stretch would use the old engine (which is much faster).

Meanwhile there exist better models for Tesseract 4: get eng.traineddata from https://github.com/tesseract-ocr/tessdata_best for best results or from https://github.com/tesseract-ocr/tessdata_fast for fast OCR with good results. Those new models only support LSTM, but not the old OCR engine.

egorpugin commented 6 years ago

BTW, it is worth to compare with MSVC builds. I'm a bit sceptical about MinGW-w64 from UB Mannheim and in MinGW-w64 at all. It could provide another layer of wrappers around WinAPI via Linux pseudosyscalls.

stweil commented 6 years ago

MSVC has a good reputation regarding code quality and might have a better implementation of OpenMP than gcc for Windows. As I said before, MinGW-w64 (and therefore also the UB Mannheim executables) uses gcc, so that's the same binary code for central parts (like dot product) as the Linux code. Therefore there should be only a small difference for single threaded Tesseract.

Shreeshrii commented 6 years ago

@zdenop Please label

Performance

Shreeshrii commented 6 years ago

In order to add jp2 lib, I just built both leptonica and tesseract using cmake with default options.

I find the OCR with this is much much slower than the version I had built with autotools/make.

This may have to do with the fact that with autotools, while running configure I had disabled openmp, opencl and graphics.

How to disable these three when building using cmake?

Shreeshrii commented 6 years ago

@egorpugin How to disable openmp, opencl and graphics while building tesseract with cmake for running on linux? Since I built leptonica with it, I have to use same for tesseract (otherwise there are libraryname issues).

egorpugin commented 6 years ago

The best way for now is to remove those options from CMakeLists.txt. But I'm not very sure in linux cmake builds. They are very very untested, sorry.

zdenop commented 5 years ago

I don't think removing options is good idea.

Please have in mind end users, who don't want/can't compile tesseract by them-self. I think If option has no huge side effect or could be easily turn off, it should be compiled.

stweil commented 5 years ago

@leemorton, could you please repeat your performance test with the latest 64 bit installer? I‌ assume that you used 32 bit Tesseract on Windows and 64 bit Tesseract on Linux, so that might explain some performance differences.

stweil commented 5 years ago

contains_unichar_id(unichar_id):Error:Assert failed:in file ../../../../ccutil/unicharset.h, line 513

That bug was recently fixed.

stweil commented 5 years ago

I close this issue as there was no recent activity and recent code does not show large differences for the performance on Linux and Windows.