tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.89k stars 9.47k forks source link

With multiple languages, tessedit_do_invert=0 and other parameters only work for the first one #3037

Open cyh1220 opened 4 years ago

cyh1220 commented 4 years ago

Hi everyone,

I'm not sure this is the default behavior or not, but I found that tessedit_do_invert=0 only works for the first one with multiple languages?

Environment

tesseract 4.1.1 leptonica-1.78.0 libgif 5.2.1 : libjpeg 6b (libjpeg-turbo 1.2.1) : libpng 1.2.49 : libtiff 3.9.4 : zlib 1.2.3 Found AVX Found SSE Linux xian 2.6.32-754.el6.x86_64 #1 SMP Tue Jun 19 21:26:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Current Behavior:

Here is the testing image: original

If I run time tesseract original.png stdout -l eng --oem 1 --psm 1 -c tessedit_do_invert=1

India and China went to war in 1962 over the same Himalayan region where at least 20 soldiers were killed Monday night in a bloody confrontation between the two sides. A little under six decades ago, one month of combat resulted in a Chinese military victory, with Beijing declaring a cease-fire after securing de facto control of Aksai Chin, an area claimed by both countries. The month-long battle claimed the lives of around 700 Chinese troops and approximately double that on the Indian side. But the militaries that face off in the Himalayas today are far different from those that fought 58 years ago. tesseract original.png stdout -l eng --oem 1 --psm 1 -c tessedit_do_invert=1 6.41s user 0.78s system 98% cpu 7.281 total

Then I run time tesseract original.png stdout -l eng --oem 1 --psm 1 -c tessedit_do_invert=0

India and China went to war in 1962 over the same Himalayan region where at least 20 soldiers were killed Monday night in a bloody confrontation between the two sides. A little under six decades ago, one month of combat resulted in a Chinese military victory, with Beijing declaring a cease-fire after securing de facto control of Aksai Chin, an area claimed by both countries. The month-long battle claimed the lives of around 700 Chinese troops and approximately double that on the Indian side. But the militaries that face off in the Himalayas today are far different from those that fought 58 years ago. tesseract original.png stdout -l eng --oem 1 --psm 1 -c tessedit_do_invert=0 4.55s user 0.53s system 98% cpu 5.132 total

Clearly tesseract run faster. Then I test inverted image (white words dark background) inverted1

I run time tesseract inverted1.png stdout -l eng --oem 1 --psm 1 -c tessedit_do_invert=0

[Te IEE: Tale Nea Te ERY oI Co RV RL We ALP A oI d TRL LW a [Ta EE ENE ON (Te Ce RY LT (SR A ETN PIO Ro) (o [TT RNVIT CW CI (To MYT a Ye FAVA aT Ted oY Ta IF: We] [oYoTe \YA Wo) oY go) q) k=Yulo) oW o TI ANVLT=T sds TRAV} Se SEIN [Nad COTY [TR Ve [=Yor=Yo [ES Te {ol oT a TN ga YoT a1 ds Wo) lle) gl oF: A {01 L (To RTS N= @ gg FoI) TIT =TaVAY/ oi do] VAT do Wa T=T [Tg Velo [Tol EY Vo fe Wels {1 =X | = YT el0 [a= Re [=B £=Tor do Melo) glu fo] Wo} § A CET Oa MET WET (= Rol ET T=To Mo) YA oY) d o Wolo Uo i gT-ET Ne V=Wa Ye) aN da Bi [oa Vo ql ol ud [SH Egg TTe Rdg I<) (N16) i100] ole IA 0H ® 113 TIS -Ri fo ToT o 1: oo R=To] ef fo (10g EY (1 \Ve [o10] o] [Rd oY A oY a Wd a TW [SV [ETA BS To [oH INR NI ETT a El A Elo Re i aR [WR [TE EVE IR Cee ENACT (RE To [1 i {<1 ¢=Ta 1 di gel g ad a Le Y=) LEY (oe NH AVIE 1S To (0 tesseract inverted1.png stdout -l eng --oem 1 --psm 1 -c tessedit_do_invert=0 4.50s user 0.53s system 98% cpu 5.096 total

As expected, tesseract can not work on inverted image with tessedit_do_invert=0

But when I run with multiple languages like time tesseract inverted1.png stdout -l chi_tra+eng --oem 1 --psm 1 -c tessedit_do_invert=0

India and China went to war in 1962 over the same Himalayan region where at least 20 soldiers were killed Monday night in a bloody confrontation between the two sides. A little under six decades ago, one month of combat resulted in a Chinese military victory, with Beijing declaring a cease-fire after securing de facto control of Aksai Chin, an area claimed by both countries. The month-long battle claimed the lives of around 700 Chinese troops and approximately double that on the Indian side. But the militaries that face off in the Himalayas today are far different from those that fought 58 years ago. tesseract inverted1.png stdout -l chi_tra+eng --oem 1 --psm 1 -c 10.08s user 1.34s system 98% cpu 11.545 total

The result is correct!!! Then I test time tesseract inverted1.png stdout -l eng+chi_tra --oem 1 --psm 1 -c tessedit_do_invert=0

India and China wentto war in 1962 overthe same Himalayan region where at least 20 soldiers were killed Monday nightinabloody confrontation between thetwo sides. Alittle under six decades ago, one month of combat resulted inaChinese military victory with Beijing declaring a cease-fire after securing de facto control of PE 二 Ca 加 三 ELEE 三 上 lives of around 700 Chinese troops and approximately double that on the Indian side. Butthe militaries that face offin the Himalayas today are far different from those that fought 58 years ago. tesseract inverted1.png stdout -l eng+chi_tra --oem 1 --psm 1 -c 22.35s user 3.22s system 96% cpu 26.538 total

So it seems that tessedit_do_invert=0 only works for the first language? If the answer is yes, I can't get "full benefit" from tessedit_do_invert for the image with multiple languages... For example: mix If I run tesseract inverted1.png stdout -l eng+chi_tra --oem 1 --psm 1 -c tessedit_do_invert=0, only English model will not check inverted text.

Expected Behavior:

If I set tessedit_do_invert=0, it means that I'm sure the image has no inverted text, so all languages should not check inverted text.

stweil commented 2 years ago

I am afraid that none of the parameters in class Tesseract is currently set for the second (and following) language, so this issue not only affects tessedit_do_invert (see full list in tesseractclass.h).

For each language there is a separate Tesseract object. Simply copying the parameter values from the first object to the second object does not work because parameters are not only set from the command line.

stweil commented 2 years ago

A simple workaround is writing all parameters into a parameter file and passing that file instead of setting the parameters on the command line. Create a file named noinvert with a single line tessedit_do_invert 0 and pass noinvert instead of -c tessedit_do_invert=0 to tesseract. You can also add a second line with some invalid parameter (mytest 0) and will see one warning for each language, so that parameter file is parsed for each language.