Open roozgar opened 5 years ago
I can confirm this. I see a similar behave using german.
@chrys87, which models did you use? Can you add an example image, so it is possible to reproduce the issue?
Another issue regarding multi-language recognition is reported in forum at https://groups.google.com/d/msgid/tesseract-ocr/66e7ba26da873cc265cf82f0c65fbe69%40posteo.net
@chrys87, which models did you use? Can you add an example image, so it is possible to reproduce the issue?
i don't have an special Image. I created an tool what takes an screenshot of the current window and runs OCR on that. https://github.com/chrys87/ocrdesktop
i use -l deu+eng languages. to reproduce it, just take an screenshot and run tesseract -l deu+eng screenshot.png
it does badly recognize special characters like ÄÖÜäöüß in german.
i attached an simple example screenshot (done with LO Writer). here is my output:
13:26 [chrys@blackbeast Bilder] :) $ tesseract Screenshot_tesseract.png test -l eng+deu
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
13:26 [chrys@blackbeast Bilder] :) $ cat test.txt
Das ist ein Test OSA46
o1
später
Spaß
13:27 [chrys@blackbeast Bilder] :) $ tesseract Screenshot_tesseract.png test -l deu+eng
Tesseract Open Source OCR Engine v4.1.0 with Leptonica
13:27 [chrys@blackbeast Bilder] :) $ cat test.txt
Das ist ein Test Ö5Ääß
Öl
später
Spaß
Correct would be:
Das ist ein Test ÖöÄäß
Öl
später
Spaß
like the reporter wrote, its doesn't work at all for eng+deu and its in accurate for deu+enu (IMO as there is only a hand full of words)
Here the Screenshot:
@chrys87: Can you reply to Stefan question? Did you try instruction provided on wiki?
@chrys87: Can you reply to Stefan question? Did you try instruction provided on wiki?
who is Stefan? Did I miss a question? no i didn't try them as there is no issue with scanning or similar (its an screenshot). Alpha is removed from OCRdesktop.
with version 3.X it works perfectly in those simple situations.
its of course logical to me that an screenshot creates noise. but also this noise is helpful to blind users as they can indicate an arrow (like a menu) or symbols for check boxes. But the screenshot above doesn't contain stuff like that.
See also #1579
... and #683
Copying Ray's comment from https://github.com/tesseract-ocr/langdata/issues/83#issuecomment-375027879
I did have an idea for a better multi-language implementation that would cleanly use models from multiple languages at once, but that depends on getting rid of the old code, and moving the multi-language functionality into the beam search. Until the old code is gone, that would be very messy.
@stweil @noahmetzger @bertsky Is this something that can be done to improve multi language recognition?
with version 3.X it works perfectly in those simple situations.
Versions >=4.0 still support the old OCR engine.
You can use it by using --oem 0
with an old traineddata.
with version 3.X it works perfectly in those simple situations.
Versions >=4.0 still support the old OCR engine.
You can use it by using
--oem 0
with an old traineddata.
i will give a shot and reply :).
@Shreeshrii, I have a rough idea what is meant by that and yes, this is something worthwhile doing. But please keep in mind that the existing multi-model/language code does work very well with LSTM models already, even with many at once!
using --oem 0 seems to be a lot more accurate here for "umlauts" like äÄöÖüÜß Edit: just played a little more around with that, yea its a lot more accurate then without -oem 0
We should definitely try to find the error in the existing code first, before we write new multi-language implementation within the beam search itself.
At a glance, it seems this problem is somewhat restricted to combinations which have dissimilar (although overlapping) unicharsets. Can anyone confirm that? E.g. replacing eng
with Latin
when combining with deu
, does the umlaut problem go away? Or replacing eng
with Arabic
when combining with ara
, do the invalid characters disappear?
@chrys87 i used --oem 0 and got "Failed loading language" what language data you used , to get better accuracy did you compared with https://github.com/tesseract-ocr/tessdata_best ?
@chrys87 i used --orm 0 and got "Failed loading language" what language data you used , to get better accuracy did you compared with https://github.com/tesseract-ocr/tessdata_best ?
i used --oem 0 not --orm 0 just to be sure :). my bad, i do not know what is shipped by default in my distro, its ArchLinux
At a glance, it seems this problem is somewhat restricted to combinations which have dissimilar (although overlapping) unicharsets. Can anyone confirm that? E.g. replacing
eng
withLatin
when combining withdeu
, does the umlaut problem go away? Or replacingeng
withArabic
when combining withara
, do the invalid characters disappear?
you talk about tesseract-data-lat? i tried this. (deu+lat) is still as worse as with eng (deu+eng). using -l deu (without +eng) improves the situation slightly but like noted above with -oem 0 its even a lot more accurate like without.
by the way some system information (:
16:37 [chrys@blackbeast ocrdesktop] master :( $ uname -a
Linux blackbeast 5.2.9-arch1-1-ARCH #1 SMP PREEMPT Fri Aug 16 11:29:43 UTC 2019 x86_64 GNU/Linux
16:37 [chrys@blackbeast ocrdesktop] master :) $ cat /etc/lsb-release
LSB_VERSION=1.4
DISTRIB_ID=Arch
DISTRIB_RELEASE=rolling
DISTRIB_DESCRIPTION="Arch Linux"
16:36 [chrys@blackbeast ocrdesktop] master :( $ tesseract -v
tesseract 4.1.0
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.2) : libpng 1.6.37 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.3
Found AVX
Found SSE
Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.1 libzstd/1.4.0
16:40 [chrys@blackbeast ocrdesktop] master :( $ tesseract --list-langs
List of available languages (4):
deu
eng
lat
osd
you talk about tesseract-data-lat? i tried this. (deu+lat) is still as worse as with eng (deu+eng). using -l deu (without +eng) improves the situation slightly but like noted above with -oem 0 its even a lot more accurate like without.
No, not at all. This would be the Latin language, but I was referring to the Latin script model Latin.traineddata
, which (on Debian/Ubuntu) is in the pkg tesseract-ocr-script-latn
.
@chrys87 : stefan is @stweil ;-) I am sorry I did not make it clear. He was asking what model you used...
Referring wiki I mean you should focus on image preprocessing. BTW: there is similar tool Capture2Text for windows it provides following result for german :
It seems like it use 3.x tessdata (from year 2015?) with 4.00alpha - I did not investigated it deeply. It is QT based, so I expect with some adaptation it should be possible to run in on linux....
ubuntu@tesseract-ocr:~/TEST$ tesseract deu.png - -l eng Das ist ein Test OSA4G
o1 spiter Spas ubuntu@tesseract-ocr:~/TEST$ tesseract deu.png - -l deu Das ist ein Test Ö5Ääß
Öl später Spaß ubuntu@tesseract-ocr:~/TEST$ tesseract deu.png - -l eng+deu Das ist ein Test Ö5Ääß
Öl später Spaß ubuntu@tesseract-ocr:~/TEST$ tesseract deu.png - -l deu+eng Das ist ein Test Ö5Ääß
Öl später Spaß ubuntu@tesseract-ocr:~/TEST$ tesseract deu.png - -l script/Latin Das ist ein Test ÖöÄäß
Öl später Spaß
i used --oem 0 and got "Failed loading language"
@roozgar That is the case for RTL and Indic languages, since their legacy models were dependent on cube related code which has been removed from tesseract 4.
ubuntu@tesseract-ocr:~/TEST$ tesseract mult.png - -l eng
S as TFT aikya, s.m. Oneness, unity, singleness, identity, sameness,
harmony (=eKata); total, aggregate, product.
H &.) UIT aigun [S. FT-HTOT], s.m. Unskilfulness, stupidity, &c.=augun,
q.v.
A Lliyal gyal, uyyal, sam. Stag; deer, hart; wild goat.
SM) TdT ela, s.f. Cardamoms. (See ilacr.)
H ay) SATH lam, s.m. Auction, public sale (=lilam, nilam, q.q.v.).
P _=Ll elér, s.m. Ambassador, envoy, delegate, agent:—el¢t karna, To
ubuntu@tesseract-ocr:~/TEST$ tesseract mult.png - -l eng+ara
5 ايكيه TFT aikya, s.m. Oneness, unity, singleness, identity, sameness,
harmony (=eKata); total, aggregate, product.
H ايكن UIT aigun [S. FT-HTOT], s.m. Unskilfulness, stupidity, &c.=augun,
q.v.
A زوم ايل gyal, uyyal, sam. Stag; deer, hart; wild goat.
5 ايلا TdT ela, s.f. Cardamoms. (See ilacr.)
H ايلام SATH lam, s.m. Auction, public sale ) 1712771, nilam, q.q.v.).
P ماه ايلجى s.m. Ambassador, envoy, delegate, agent:—el¢t karna, To
ubuntu@tesseract-ocr:~/TEST$ tesseract mult.png - -l script/Arabic
S ایکیه UAT aikya, s.m. Oneness, unity, singleness, identity, sameness,
harmony (=eKatda); total, aggregate, product.
H این WI] aigunı [S. 3TH], s.m. Unskilfulness, stupidity, &c.=augun,
q.V.
A Jel Fy]. yal, uyyal, s.m. Stag; deer, hart; wild goat.
S YY TT ela, s.f. Cardamoms. (See ilac.)
H ايام ŠTX lam, s.m. Auction, public sale (=IHldam, nilam, q.q.v.).
P zlyl elér, s.m. Ambassador, envoy, delegate, agent:—elcr karnd, To
ubuntu@tesseract-ocr:~/TEST$ tesseract mult.png - -l eng+hin
S as TFT aikya, s.m. Oneness, unity, singleness, identity, sameness,
harmony (=eKata); total, aggregate, product.
H 0६ ऐसुण aigun [S. अव+्गुण], s.m. Unskilfulness, stupidity, &c.=augun,
q.v.
A (3 कण, gyal, uyyal, sam. Stag; deer, hart; wild goat.
S अ.। TAT ela, s.f. Cardamoms. (See ilaci.)
H ay) SATH lam, s.m. Auction, public sale (=lilam, nilam, q.q.v.).
P _=Ll टी, s.m. Ambassador, envoy, delegate, agent:—el¢7 karna, To
ubuntu@tesseract-ocr:~/TEST$ tesseract mult.png - -l script/Devanagari
S Sal ऐक्य aikya, S.m. Oneness, unity, singleness, identity, sameness,
harmony (=eRata); total, aggregate, product.
पत [6 ऐगुण aig [S. अव+गुण], ऽ.m. Unskilfulness, stupidity, &c.=augun,
q.v.
A Jl Fa] ayal, uyyal, s.m. Stag; deer, hart; wild goat.
S X| एला ९la, s.f. Cardamoms. (See ildct.)
H ~| इलाम am, s.m. Auction, public sale (=lildm, nildm, q.q.v.).
P LI elect, s.m. Ambassador, envoy, delegate, agent:—elct Karna, To
ubuntu@tesseract-ocr:~/TEST$ tesseract mult.png - -l script/Devanagari+ara
5 ايكيه ऐक्य aikya, S.m. Oneness, unity, singleness, identity, sameness,
harmony (=eRata); total, aggregate, product.
11 ايكن ऐगुण سروه ]5. अव+गुण], ऽ.m. Unskilfulness, stupidity, &c.=augun,
q.v.
A ايل Fa] ayal, uyyal, s.m. Stag; deer, hart; wild goat.
5 ايلا एला ,م1 5.1. Cardamoms. (See ildct.)
11 ايلام इलाम am, s.m. Auction, public sale (=lildm, nildm, q.q.v.).
P ايلجى elect, s.m. Ambassador, envoy, delegate, agent:—elct Karna, 10 -
@bertsky Result is better for script/Latin than eng+deu in the example screenshot of German text with umlauts.
@Shreehrii yes,i used for arabic.there are three language data available on github.i dont know which is better. there result is different in some cases and i can say which one is better. for example ara in (1) files detect ',' correctly but (2) must be best detect ',' as '«'
my tested fiels are these: 1)https://github.com/tesseract-ocr/tessdata 2)https://github.com/tesseract-ocr/tessdata_best 3)https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
@Shreeshrii, so your tests (just like my own) do not confirm the observations by @chrys87 that results depend on the order of sub-languages/models (as in eng+deu
vs deu+eng
). But maybe this is a matter of what repository the models are from (tessdata
, tessdata_fast
, tessdata_best
)?
Also, I cannot see any invalid characters anywhere, as was claimed by the OP. It's just misrecognized characters, and only from the languages that were actually loaded. In the German case, it's only a minor error (ö
does look somewhat similar to 5
) in a random, highly idiomatic string. Even if the old model happens to do better here, I don't think this could be called a regression at all. (IIRC, the old models are better in a statistically significant way under some conditions, but are generally outperformed by the new ones.)
And of course, users can always add dictionaries or query alternative symbols (the latter only via API).
@bertsky new problem report at https://github.com/tesseract-ocr/tesseract/issues/2639
So... what's the status of this issue?
As far as I can see, the problem could not be reproduced.
Related issues:
I closed all the other related issues that were still open.
Hi, the problem still can easy to reproduce.
Please check this sample simple picture
It contain Thai language and english langauge. All of them are meaningful word.
with below command
tesseract tha_eng.png tha_eng -l tha+eng
It doesn't convert english correctly, this is the output
สวัสดีครับ ทดสอบภาษาไทย กับ โ@รร๕6ล6%
I also play with order, eng+tha
But still doesn't work, with this output
ส ว ั ส ด ี ค ร ั บ ท ด ส อ บ ภา ษา ไท ย ก ั บ โ @ ร ร ๕ 6 ล 6%
This is the correct output that i expected
สวัสดีครับ ทดสอบภาษาไทย กับ Tesseract
@amitdo, could you please elaborate on why you've closed all these related issues? I cannot see any indication that they have been solved. AFAICT we cannot even be entirely sure these have exactly the same cause. There seems to be the aspect of different unicharsets, but also dictionaries and segmentation have been discussed.
(IMHO the correct procedure would be to wait for a solution, proove it on all the different reported scenarios / images and then close them, linking the commit/PR.)
@chaintng Pls. check the following comparison:
kmw@lgg119:/tmp$ tesseract tha_eng.png - -l tha+eng
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 405
สวัสดีครับ ทดสอบภาษาไทย กับ โ@รร๕๑กล6%
kmw@lgg119:/tmp$ tesseract tha_eng.png - -l tha+Latin
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 405
สวัสดีครับ ทดสอบภาษาไทย กับ โ@รร๕๑กล6%
kmw@lgg119:/tmp$ tesseract tha_eng.png - -l Thai+Latin
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 405
ส ว ั ส ด ี ค ร ั บ ท ด ส อ บ ภา ษา ไท ย ก ั บ Tesseract
Tesseract's stock models following the ISO 639 naming scheme typically contain a word list. Since the word Tesseract
is not in the eng dictionary, it receives a very low confidence score. If you use the dictionary-free script models (as in the third test), you end up with a much better result.
Hi @bertsky,
I had a feeling that someone will complain about closing these issues...
Note that some of them were closed already.
I did what I thought was the right thing to do. I think they are all highly related to each other and most probably a proper solution will solve them all.
If the maintainer(s) think(s) that these issues should be kept open, they can reverse my actions by reopening them.
Hi @amitdo
I had a feeling that someone will complain about closing these issues...
Oh, but that's my favourite! ... Seriously, just asking :smiley:
Maybe it does help to focus attention. But I do think we should revisit the other problem descriptions once we think we have the answer here.
Maybe someone could comb through all these direct observations w.r.t. the following aspects:
As I said, I don't think we absolutely need to find a single plausible explanation. Perhaps there are different causes. But it may help debugging to group them into possibly distinct sets of problems.
(And we might even see contradictory observations and still have a common cause: We must be wary that Tesseract is quite complex, so there could be compensatory mechanisms at work in any individual example. For example, we know that segmentation may sometimes separate lines horizontally – in turn also separating the sequence for the LSTM to beam-decode at once.)
@wrznr I would not jump to that conclusion TBH. Whether a word is in a language model (dict/dawg) does not change its score by that much. The issue seems to be more about different ranges of scores between models, or more likely, different ranges of scores between unicharsets (since as I pointed out earlier, the problem seems to be limited to characters from the non-first unicharset). In your concrete example, the most striking difference is that script/Thai
does contain all Roman characters, while tha
does not (so both models can compete on Tesseract
in the last example, but not in the others).
I labeled all these issues as 'multilingual ocr'.
im testing tesseract for arabic The texts in my image are arabic and english when i used eng+ara on my image it show a lot of invalid english char but when i use ara+eng it work correct but very low accuracy..
is this normal or i must do something before start of scan?
Environment