tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
60.8k stars 9.36k forks source link

multilingual ocr ara+eng #2626

Open roozgar opened 5 years ago

roozgar commented 5 years ago

im testing tesseract for arabic The texts in my image are arabic and english when i used eng+ara on my image it show a lot of invalid english char but when i use ara+eng it work correct but very low accuracy..

is this normal or i must do something before start of scan?

Environment

chrys87 commented 5 years ago

I can confirm this. I see a similar behave using german.

stweil commented 5 years ago

@chrys87, which models did you use? Can you add an example image, so it is possible to reproduce the issue?

Shreeshrii commented 5 years ago

Another issue regarding multi-language recognition is reported in forum at https://groups.google.com/d/msgid/tesseract-ocr/66e7ba26da873cc265cf82f0c65fbe69%40posteo.net

chrys87 commented 5 years ago

@chrys87, which models did you use? Can you add an example image, so it is possible to reproduce the issue?

i don't have an special Image. I created an tool what takes an screenshot of the current window and runs OCR on that. https://github.com/chrys87/ocrdesktop

i use -l deu+eng languages. to reproduce it, just take an screenshot and run tesseract -l deu+eng screenshot.png

it does badly recognize special characters like ÄÖÜäöüß in german.

i attached an simple example screenshot (done with LO Writer). here is my output:

13:26 [chrys@blackbeast Bilder] :) $ tesseract Screenshot_tesseract.png test -l eng+deu
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
13:26 [chrys@blackbeast Bilder] :) $ cat test.txt 
Das ist ein Test OSA46

o1
später
Spaß

13:27 [chrys@blackbeast Bilder] :) $ tesseract Screenshot_tesseract.png test -l deu+eng
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
13:27 [chrys@blackbeast Bilder] :) $ cat test.txt 
Das ist ein Test Ö5Ääß

Öl
später
Spaß

Correct would be:

Das ist ein Test ÖöÄäß

Öl
später
Spaß

like the reporter wrote, its doesn't work at all for eng+deu and its in accurate for deu+enu (IMO as there is only a hand full of words)

chrys87 commented 5 years ago

Here the Screenshot:

Screenshot_tesseract

zdenop commented 5 years ago

@chrys87: Can you reply to Stefan question? Did you try instruction provided on wiki?

chrys87 commented 5 years ago

@chrys87: Can you reply to Stefan question? Did you try instruction provided on wiki?

who is Stefan? Did I miss a question? no i didn't try them as there is no issue with scanning or similar (its an screenshot). Alpha is removed from OCRdesktop.

with version 3.X it works perfectly in those simple situations.

its of course logical to me that an screenshot creates noise. but also this noise is helpful to blind users as they can indicate an arrow (like a menu) or symbols for check boxes. But the screenshot above doesn't contain stuff like that.

amitdo commented 5 years ago

See also #1579

amitdo commented 5 years ago

... and #683

Shreeshrii commented 5 years ago

Also see https://groups.google.com/d/msgid/tesseract-ocr/CAOYxz4rt61etBF%2BXgdzqRLDFs72h_KJ4mj1yYCt5dSbOrGusCw%40mail.gmail.com regarding eng+urd

Shreeshrii commented 5 years ago

Copying Ray's comment from https://github.com/tesseract-ocr/langdata/issues/83#issuecomment-375027879

I did have an idea for a better multi-language implementation that would cleanly use models from multiple languages at once, but that depends on getting rid of the old code, and moving the multi-language functionality into the beam search. Until the old code is gone, that would be very messy.

@stweil @noahmetzger @bertsky Is this something that can be done to improve multi language recognition?

amitdo commented 5 years ago

with version 3.X it works perfectly in those simple situations.

Versions >=4.0 still support the old OCR engine.

You can use it by using --oem 0 with an old traineddata.

chrys87 commented 5 years ago

with version 3.X it works perfectly in those simple situations.

Versions >=4.0 still support the old OCR engine.

You can use it by using --oem 0 with an old traineddata.

i will give a shot and reply :).

bertsky commented 5 years ago

@Shreeshrii, I have a rough idea what is meant by that and yes, this is something worthwhile doing. But please keep in mind that the existing multi-model/language code does work very well with LSTM models already, even with many at once!

chrys87 commented 5 years ago

using --oem 0 seems to be a lot more accurate here for "umlauts" like äÄöÖüÜß Edit: just played a little more around with that, yea its a lot more accurate then without -oem 0

bertsky commented 5 years ago

We should definitely try to find the error in the existing code first, before we write new multi-language implementation within the beam search itself.

bertsky commented 5 years ago

At a glance, it seems this problem is somewhat restricted to combinations which have dissimilar (although overlapping) unicharsets. Can anyone confirm that? E.g. replacing eng with Latin when combining with deu, does the umlaut problem go away? Or replacing eng with Arabic when combining with ara, do the invalid characters disappear?

roozgar commented 5 years ago

@chrys87 i used --oem 0 and got "Failed loading language" what language data you used , to get better accuracy did you compared with https://github.com/tesseract-ocr/tessdata_best ?

chrys87 commented 5 years ago

@chrys87 i used --orm 0 and got "Failed loading language" what language data you used , to get better accuracy did you compared with https://github.com/tesseract-ocr/tessdata_best ?

i used --oem 0 not --orm 0 just to be sure :). my bad, i do not know what is shipped by default in my distro, its ArchLinux

chrys87 commented 5 years ago

At a glance, it seems this problem is somewhat restricted to combinations which have dissimilar (although overlapping) unicharsets. Can anyone confirm that? E.g. replacing eng with Latin when combining with deu, does the umlaut problem go away? Or replacing eng with Arabic when combining with ara, do the invalid characters disappear?

you talk about tesseract-data-lat? i tried this. (deu+lat) is still as worse as with eng (deu+eng). using -l deu (without +eng) improves the situation slightly but like noted above with -oem 0 its even a lot more accurate like without.

chrys87 commented 5 years ago

by the way some system information (:

16:37 [chrys@blackbeast ocrdesktop] master :( $ uname -a
Linux blackbeast 5.2.9-arch1-1-ARCH #1 SMP PREEMPT Fri Aug 16 11:29:43 UTC 2019 x86_64 GNU/Linux

16:37 [chrys@blackbeast ocrdesktop] master :) $ cat /etc/lsb-release 
LSB_VERSION=1.4
DISTRIB_ID=Arch
DISTRIB_RELEASE=rolling
DISTRIB_DESCRIPTION="Arch Linux"

16:36 [chrys@blackbeast ocrdesktop] master :( $ tesseract -v
tesseract 4.1.0
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.2) : libpng 1.6.37 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.3
 Found AVX
 Found SSE
 Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.1 libzstd/1.4.0

16:40 [chrys@blackbeast ocrdesktop] master :( $ tesseract --list-langs
List of available languages (4):
deu
eng
lat
osd
bertsky commented 5 years ago

you talk about tesseract-data-lat? i tried this. (deu+lat) is still as worse as with eng (deu+eng). using -l deu (without +eng) improves the situation slightly but like noted above with -oem 0 its even a lot more accurate like without.

No, not at all. This would be the Latin language, but I was referring to the Latin script model Latin.traineddata, which (on Debian/Ubuntu) is in the pkg tesseract-ocr-script-latn.

zdenop commented 5 years ago

@chrys87 : stefan is @stweil ;-) I am sorry I did not make it clear. He was asking what model you used...

Referring wiki I mean you should focus on image preprocessing. BTW: there is similar tool Capture2Text for windows it provides following result for german :

image

It seems like it use 3.x tessdata (from year 2015?) with 4.00alpha - I did not investigated it deeply. It is QT based, so I expect with some adaptation it should be possible to run in on linux....

Shreeshrii commented 5 years ago

ubuntu@tesseract-ocr:~/TEST$ tesseract deu.png - -l eng Das ist ein Test OSA4G

o1 spiter Spas ubuntu@tesseract-ocr:~/TEST$ tesseract deu.png - -l deu Das ist ein Test Ö5Ääß

Öl später Spaß ubuntu@tesseract-ocr:~/TEST$ tesseract deu.png - -l eng+deu Das ist ein Test Ö5Ääß

Öl später Spaß ubuntu@tesseract-ocr:~/TEST$ tesseract deu.png - -l deu+eng Das ist ein Test Ö5Ääß

Öl später Spaß ubuntu@tesseract-ocr:~/TEST$ tesseract deu.png - -l script/Latin Das ist ein Test ÖöÄäß

Öl später Spaß

Shreeshrii commented 5 years ago

i used --oem 0 and got "Failed loading language"

@roozgar That is the case for RTL and Indic languages, since their legacy models were dependent on cube related code which has been removed from tesseract 4.

Shreeshrii commented 5 years ago

ara-hin-eng

ubuntu@tesseract-ocr:~/TEST$ tesseract mult.png - -l eng

S as TFT aikya, s.m. Oneness, unity, singleness, identity, sameness,
harmony (=eKata); total, aggregate, product.

H &.) UIT aigun [S. FT-HTOT], s.m. Unskilfulness, stupidity, &c.=augun,
q.v.

A Lliyal gyal, uyyal, sam. Stag; deer, hart; wild goat.
SM) TdT ela, s.f. Cardamoms. (See ilacr.)
H ay) SATH lam, s.m. Auction, public sale (=lilam, nilam, q.q.v.).

P  _=Ll elér, s.m. Ambassador, envoy, delegate, agent:—el¢t karna, To

ubuntu@tesseract-ocr:~/TEST$ tesseract mult.png - -l eng+ara


5 ‏ايكيه‎ TFT aikya, s.m. Oneness, unity, singleness, identity, sameness,
harmony (=eKata); total, aggregate, product.

H ‏ايكن‎ UIT aigun [S. FT-HTOT], s.m. Unskilfulness, stupidity, &c.=augun,
q.v.

A ‏زوم ايل‎ gyal, uyyal, sam. Stag; deer, hart; wild goat.
5 ‏ايلا‎ TdT ela, s.f. Cardamoms. (See ilacr.)
H ‏ايلام‎ SATH lam, s.m. Auction, public sale ) 1712771, nilam, q.q.v.).

P ‏ماه ايلجى‎ s.m. Ambassador, envoy, delegate, agent:—el¢t karna, To

ubuntu@tesseract-ocr:~/TEST$ tesseract mult.png - -l script/Arabic


S ‏ایکیه‎ UAT aikya, s.m. Oneness, unity, singleness, identity, sameness,
harmony (=eKatda); total, aggregate, product.

H ‏این‎ WI] aigunı [S. 3TH], s.m. Unskilfulness, stupidity, &c.=augun,
q.V.

A Jel Fy]. yal, uyyal, s.m. Stag; deer, hart; wild goat.
S YY TT ela, s.f. Cardamoms. (See ilac.)
H ‏ايام‎ ŠTX lam, s.m. Auction, public sale (=IHldam, nilam, q.q.v.).

P zlyl elér, s.m. Ambassador, envoy, delegate, agent:—elcr karnd, To

ubuntu@tesseract-ocr:~/TEST$ tesseract mult.png - -l eng+hin


S as TFT aikya, s.m. Oneness, unity, singleness, identity, sameness,
harmony (=eKata); total, aggregate, product.

H 0६ ऐसुण aigun [S. अव+्गुण], s.m. Unskilfulness, stupidity, &c.=augun,
q.v.

A (3 कण, gyal, uyyal, sam. Stag; deer, hart; wild goat.
S अ.। TAT ela, s.f. Cardamoms. (See ilaci.)
H ay) SATH lam, s.m. Auction, public sale (=lilam, nilam, q.q.v.).

P  _=Ll टी, s.m. Ambassador, envoy, delegate, agent:—el¢7 karna, To

ubuntu@tesseract-ocr:~/TEST$ tesseract mult.png - -l script/Devanagari


S Sal ऐक्य aikya, S.m. Oneness, unity, singleness, identity, sameness,
harmony (=eRata); total, aggregate, product.

पत [6 ऐगुण aig [S. अव+गुण], ऽ.m. Unskilfulness, stupidity, &c.=augun,
q.v.

A Jl Fa] ayal, uyyal, s.m. Stag; deer, hart; wild goat.
S X| एला ९la, s.f. Cardamoms. (See ildct.)
H ~| इलाम am, s.m. Auction, public sale (=lildm, nildm, q.q.v.).

P LI elect, s.m. Ambassador, envoy, delegate, agent:—elct Karna, To

ubuntu@tesseract-ocr:~/TEST$ tesseract mult.png - -l script/Devanagari+ara


5 ‏ايكيه‎ ऐक्य aikya, S.m. Oneness, unity, singleness, identity, sameness,
harmony (=eRata); total, aggregate, product.

11 ‏ايكن‎ ऐगुण ‏سروه‎ ]5. अव+गुण], ऽ.m. Unskilfulness, stupidity, &c.=augun,
q.v.

A ‏ايل‎ Fa] ayal, uyyal, s.m. Stag; deer, hart; wild goat.
5 ‏ايلا‎ एला ‏,م1‎ 5.1. Cardamoms. (See ildct.)
11 ‏ايلام‎ इलाम am, s.m. Auction, public sale (=lildm, nildm, q.q.v.).

P ‏ايلجى‎ elect, s.m. Ambassador, envoy, delegate, agent:—elct Karna, 10 -
Shreeshrii commented 5 years ago

@bertsky Result is better for script/Latin than eng+deu in the example screenshot of German text with umlauts.

roozgar commented 5 years ago

@Shreehrii yes,i used for arabic.there are three language data available on github.i dont know which is better. there result is different in some cases and i can say which one is better. for example ara in (1) files detect ',' correctly but (2) must be best detect ',' as '«'

my tested fiels are these: 1)https://github.com/tesseract-ocr/tessdata 2)https://github.com/tesseract-ocr/tessdata_best 3)https://github.com/tesseract-ocr/tesseract/wiki/Data-Files

bertsky commented 5 years ago

@Shreeshrii, so your tests (just like my own) do not confirm the observations by @chrys87 that results depend on the order of sub-languages/models (as in eng+deu vs deu+eng). But maybe this is a matter of what repository the models are from (tessdata, tessdata_fast, tessdata_best)?

Also, I cannot see any invalid characters anywhere, as was claimed by the OP. It's just misrecognized characters, and only from the languages that were actually loaded. In the German case, it's only a minor error (ö does look somewhat similar to 5) in a random, highly idiomatic string. Even if the old model happens to do better here, I don't think this could be called a regression at all. (IIRC, the old models are better in a statistically significant way under some conditions, but are generally outperformed by the new ones.)

And of course, users can always add dictionaries or query alternative symbols (the latter only via API).

Shreeshrii commented 5 years ago

@bertsky new problem report at https://github.com/tesseract-ocr/tesseract/issues/2639

UsernameIsAlreadyTaken6 commented 4 years ago

So... what's the status of this issue?

wrznr commented 4 years ago

As far as I can see, the problem could not be reproduced.

amitdo commented 3 years ago

Related issues:

633, #683, #1222. #1547, #1548, #1579, #1599, #2639, #3287.

amitdo commented 3 years ago

I closed all the other related issues that were still open.

chaintng commented 3 years ago

Hi, the problem still can easy to reproduce.

Please check this sample simple picture image

It contain Thai language and english langauge. All of them are meaningful word.

with below command

tesseract tha_eng.png tha_eng -l tha+eng

It doesn't convert english correctly, this is the output

สวัสดีครับ ทดสอบภาษาไทย กับ โ@รร๕6ล6%

I also play with order, eng+tha But still doesn't work, with this output

ส ว ั ส ด ี ค ร ั บ ท ด ส อ บ ภา ษา ไท ย ก ั บ โ @ ร ร ๕ 6 ล 6%

This is the correct output that i expected

สวัสดีครับ ทดสอบภาษาไทย กับ Tesseract
bertsky commented 3 years ago

@amitdo, could you please elaborate on why you've closed all these related issues? I cannot see any indication that they have been solved. AFAICT we cannot even be entirely sure these have exactly the same cause. There seems to be the aspect of different unicharsets, but also dictionaries and segmentation have been discussed.

(IMHO the correct procedure would be to wait for a solution, proove it on all the different reported scenarios / images and then close them, linking the commit/PR.)

wrznr commented 3 years ago

@chaintng Pls. check the following comparison:

kmw@lgg119:/tmp$ tesseract tha_eng.png - -l tha+eng
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 405
สวัสดีครับ ทดสอบภาษาไทย กับ โ@รร๕๑กล6%

kmw@lgg119:/tmp$ tesseract tha_eng.png - -l tha+Latin
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 405
สวัสดีครับ ทดสอบภาษาไทย กับ โ@รร๕๑กล6%

kmw@lgg119:/tmp$ tesseract tha_eng.png - -l Thai+Latin
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 405
ส ว ั ส ด ี ค ร ั บ ท ด ส อ บ ภา ษา ไท ย ก ั บ Tesseract

Tesseract's stock models following the ISO 639 naming scheme typically contain a word list. Since the word Tesseract is not in the eng dictionary, it receives a very low confidence score. If you use the dictionary-free script models (as in the third test), you end up with a much better result.

amitdo commented 3 years ago

Hi @bertsky,

I had a feeling that someone will complain about closing these issues...

Note that some of them were closed already.

I did what I thought was the right thing to do. I think they are all highly related to each other and most probably a proper solution will solve them all.

If the maintainer(s) think(s) that these issues should be kept open, they can reverse my actions by reopening them.

bertsky commented 3 years ago

Hi @amitdo

I had a feeling that someone will complain about closing these issues...

Oh, but that's my favourite! ... Seriously, just asking :smiley:

Maybe it does help to focus attention. But I do think we should revisit the other problem descriptions once we think we have the answer here.

Maybe someone could comb through all these direct observations w.r.t. the following aspects:

As I said, I don't think we absolutely need to find a single plausible explanation. Perhaps there are different causes. But it may help debugging to group them into possibly distinct sets of problems.

(And we might even see contradictory observations and still have a common cause: We must be wary that Tesseract is quite complex, so there could be compensatory mechanisms at work in any individual example. For example, we know that segmentation may sometimes separate lines horizontally – in turn also separating the sequence for the LSTM to beam-decode at once.)

@wrznr I would not jump to that conclusion TBH. Whether a word is in a language model (dict/dawg) does not change its score by that much. The issue seems to be more about different ranges of scores between models, or more likely, different ranges of scores between unicharsets (since as I pointed out earlier, the problem seems to be limited to characters from the non-first unicharset). In your concrete example, the most striking difference is that script/Thai does contain all Roman characters, while tha does not (so both models can compete on Tesseract in the last example, but not in the others).

amitdo commented 3 years ago

I labeled all these issues as 'multilingual ocr'.