tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
60.93k stars 9.37k forks source link

German - Characters added to result multiple times (aä / AÄ) #1060

Open TheSeiko opened 7 years ago

TheSeiko commented 7 years ago

tesseract 4.00.00alpha leptonica-1.74.1 libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.20 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0 Win10 64bit - built Uni Mannheim

deu.traineddata - Repeating of characters:

Current Behavior:

ÄGYPTEN -> ÄAGYPTEN Grand-Prix -> Gräand-Prix AUSTRALIEN -> AUSTRAÄLIEN GROSSBRITANNIEN -> GROSSBRITAÄANNIEN

Expected Behavior:

ÄGYPTEN -> ÄGYPTEN Grand-Prix -> Grand-Prix AUSTRALIEN -> AUSTRALIEN GROSSBRITANNIEN -> GROSSBRITANNIEN

Suggested Fix: 1 blob / 1 box should only be 1 outcome / 1 result

Additional Info: Example images are available for posting

TheSeiko commented 7 years ago

Additional example (wW) VW-Werk -> VwW-Werk

Shreeshrii commented 7 years ago

Please test with the latest models available in https://github.com/tesseract-ocr/tessdata/tree/master/best

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Tue, Aug 1, 2017 at 9:45 PM, TheSeiko notifications@github.com wrote:

Additional example (wW) VW-Werk -> VwW-Werk

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1060#issuecomment-319419707, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o9N-1P4rP3fa2mYMcL4zS0Z8LMEYks5sT08WgaJpZM4Op9eR .

amitdo commented 7 years ago

Suggested Fix: 1 blob / 1 box should only be 1 outcome / 1 result

1) It won't work with ligatures. 2) With the legacy OCR engine, there is a character segmentation step, and the OCR is done on individual char blobs.
With the new LSTM engine, the OCR is done by the neural network on sequence of pixels in text lines, not on pre-segmented blobs.

amitdo commented 7 years ago

The fix for most problems with the LSTM engine is more / better training.

DAS2016 Sildes, 6. Modernization Efforts Page 17 Encyclopedia -> EE-n-c-yy-c-l-o-p-e-d-i-a -> Encyclopedia

I think that for 'in dictionary' words these kind of duplications would be eliminated.

amitdo commented 7 years ago

Similar issues: #884 #1011

TheSeiko commented 7 years ago

@Shreeshrii

https://github.com/tesseract-ocr/tessdata/tree/master/best is not working @all

deu.traineddata 19.721 KB best - deu.traineddata 8.427 KB

best trainingdata only delivers empty results

Shreeshrii commented 7 years ago

Are you using --oem 1?

you can see the contents of the traineddata by

combine_tessdata -u deu,traineddata

These are probably only lstm models and do not have the legacy engine which is used via --oem 0

Shreeshrii commented 7 years ago

@stweil Have you tested the deu model?

TheSeiko commented 7 years ago

Yes, I'm using --oem 1

I'm just switching deu.traineddata in tessdata Old one works without problems, new one -> empty output Checked the download - downloaded File has same size as in the repository.

Shreeshrii commented 7 years ago

@stweil - you may need to update the windows binaries on Uni Mannheim site with the latest updates from Ray.

@TheSeiko I haven't personally tested the deu model. WIll check and post result. Wondering whether your Windows binary is old....

Shreeshrii commented 7 years ago

Looks like you need both deu and frk models

wget -O ./tess4data-save/deubest.traineddata https://github.com/tesseract-ocr/tessdata/blob/master/best/deu.traineddata?raw=true

sudo cp ./tess4data-save/*.traineddata /usr/share/tesseract-ocr/4.00/tessdata

time tesseract ./tif/phototest.tif stdout --oem 1 -l deu time tesseract ./tif/phototest.tif stdout --oem 1 -l deubest

Page 1
This is a lot of 12 point text to test the
ocr code and see if it works on all types
of file format.
The quick brown dog jumped over the
lazy fox. The quick brown dog jumped
over the lazy fox. The quick brown dog
jumped over the lazy fox. The quick
brown dog jumped over the lazy fox.
real    0m1.633s
user    0m2.032s
sys 0m0.492s
Error opening data file /usr/share/tesseract-ocr/4.00/frk.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'frk'
Page 1
This is a lot of 12 point text to test the
ocr code and see if it works on all types
of file format.
The quick brown dog jumped over the
lazy fox. The quick brown dog jumped
over the lazy fox. The quick brown dog
Jumped over the lazy fox. The quick
brown dog jumped over the lazy fox.
real    0m2.045s
user    0m2.744s
sys 0m0.612s
Shreeshrii commented 7 years ago

works on linux - looks for frk traineddata, probably listed in deu.config

stweil commented 7 years ago

Have you tested the deu model?

The new best one? No, I have not tested it yet. I am currently focused on Fraktur where the new results clearly beat the old ones.

[...] update the windows binaries on Uni Mannheim site [...]

I noticed on Linux that "old" Tesseract executables crash with the new traineddata, so I expect that my current Windows binaries would crash, too. Building new ones is on my list.

TheSeiko commented 7 years ago

thank you

stweil commented 7 years ago

[...] update the windows binaries on Uni Mannheim site [...]

The new binaries are now available. I now use semantic versioning, so this is my 4.0.0-alpha.20170804.

Shreeshrii commented 7 years ago

Thanks!

I now use semantic versioning, so this is my 4.0.0-alpha.20170804.

:-)

TheSeiko commented 7 years ago

Thank you for the new binaries.

There are still similar errors:

hitzefrei -> 1 x hitzefreii / 1 x hitzefreil

Suggestion: The results are a lot better with 4.0. LSTM than with 3.05.01 but training seams to be difficult. Maybe it would be a good idea to offer a webpage where people could upload example image-files and matching text-files to include them in the training process.

stweil commented 7 years ago

looks for frk traineddata, probably listed in deu.config

@theraysmith, best/deu.traineddata includes a deu.config with tessedit_load_sublangs frk. Why was this dependency added? It is confusing for end users who want to use -l deu that they need frk.traineddata, too.

@TheSeiko, maybe you'd get better results for Antiqua text without that frk dependency (which might be good for texts which also include Fraktur). You can use combine_tessdata to extract the components of best/deu.traineddata, remove deu.config and combine the remaining components again in a new file.

TheSeiko commented 7 years ago

Thank you for the tip. Much appreciated!

TheSeiko commented 7 years ago

@stweil Am I doing something wrong?

There's only a version file included in the deu,traineddata when using the binaries from 04.08

E:\Tesseract-OCR4.0a2>combine_tessdata -u deu.traineddata tmp/deu Extracting tessdata components from deu.traineddata Wrote tmp/deu.version Version string:4.0.0-alpha.20170804 23:version:size=20, offset=192

E:\Tesseract-OCR4.0a2>combine_tessdata -u deu.traineddata tmp/deu. Extracting tessdata components from deu.traineddata Wrote tmp/deu.version Version string:4.0.0-alpha.20170804 23:version:size=20, offset=192

E:\Tesseract-OCR4.0a2>combine_tessdata -d deu.traineddata Version string:4.0.0-alpha.20170804 23:version:size=20, offset=192

theraysmith commented 7 years ago

Looks like I introduced a bug. If the traineddata file doesn't exist, it makes an empty one with a version string in it, instead of complaining about the non-existent file.

On Sun, Aug 6, 2017 at 8:00 PM, TheSeiko notifications@github.com wrote:

@stweil https://github.com/stweil Am I doing something wrong?

There's only a version file included in the deu,traineddata when using the binaries from 04.08

E:\Tesseract-OCR4.0a2>combine_tessdata -u deu.traineddata tmp/deu Extracting tessdata components from deu.traineddata Wrote tmp/deu.version Version string:4.0.0-alpha.20170804 23:version:size=20, offset=192

E:\Tesseract-OCR4.0a2>combine_tessdata -u deu.traineddata tmp/deu. Extracting tessdata components from deu.traineddata Wrote tmp/deu.version Version string:4.0.0-alpha.20170804 23:version:size=20, offset=192

E:\Tesseract-OCR4.0a2>combine_tessdata -d deu.traineddata Version string:4.0.0-alpha.20170804 23:version:size=20, offset=192

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1060#issuecomment-320557491, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056cltXp_aF2gKV9kC0kWvs3JtVSFnks5sVn3ggaJpZM4Op9eR .

-- Ray.

Shreeshrii commented 7 years ago

Ray, There have been a number of reports of people not being able to run the english tutorial training. Missing eng.config etc Posted in tesseract-ocr forum

On 07-Aug-2017 9:49 PM, "theraysmith" notifications@github.com wrote:

Looks like I introduced a bug. If the traineddata file doesn't exist, it makes an empty one with a version string in it, instead of complaining about the non-existent file.

On Sun, Aug 6, 2017 at 8:00 PM, TheSeiko notifications@github.com wrote:

@stweil https://github.com/stweil Am I doing something wrong?

There's only a version file included in the deu,traineddata when using the binaries from 04.08

E:\Tesseract-OCR4.0a2>combine_tessdata -u deu.traineddata tmp/deu Extracting tessdata components from deu.traineddata Wrote tmp/deu.version Version string:4.0.0-alpha.20170804 23:version:size=20, offset=192

E:\Tesseract-OCR4.0a2>combine_tessdata -u deu.traineddata tmp/deu. Extracting tessdata components from deu.traineddata Wrote tmp/deu.version Version string:4.0.0-alpha.20170804 23:version:size=20, offset=192

E:\Tesseract-OCR4.0a2>combine_tessdata -d deu.traineddata Version string:4.0.0-alpha.20170804 23:version:size=20, offset=192

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1060# issuecomment-320557491, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056cltXp_ aF2gKV9kC0kWvs3JtVSFnks5sVn3ggaJpZM4Op9eR .

-- Ray.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1060#issuecomment-320710053, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o0ifee0UQtGx7y9q_bdy7S-bpU8Gks5sVzkkgaJpZM4Op9eR .

theraysmith commented 7 years ago

I just made 3 commits that address some of these issues: Error message for lack of --traineddata arg referring to wiki. Emphasis that the lack of config file is just a warning. Detected non-existent traineddata file in combine_tessdata.

It seems the majority of the problems are lack of sync of code/data. There are dependencies between code and data that have changed due to moving the unicharset from the LSTM model to the traineddata file.

On Mon, Aug 7, 2017 at 9:27 AM, Shreeshrii notifications@github.com wrote:

Ray, There have been a number of reports of people not being able to run the english tutorial training. Missing eng.config etc Posted in tesseract-ocr forum

On 07-Aug-2017 9:49 PM, "theraysmith" notifications@github.com wrote:

Looks like I introduced a bug. If the traineddata file doesn't exist, it makes an empty one with a version string in it, instead of complaining about the non-existent file.

On Sun, Aug 6, 2017 at 8:00 PM, TheSeiko notifications@github.com wrote:

@stweil https://github.com/stweil Am I doing something wrong?

There's only a version file included in the deu,traineddata when using the binaries from 04.08

E:\Tesseract-OCR4.0a2>combine_tessdata -u deu.traineddata tmp/deu Extracting tessdata components from deu.traineddata Wrote tmp/deu.version Version string:4.0.0-alpha.20170804 23:version:size=20, offset=192

E:\Tesseract-OCR4.0a2>combine_tessdata -u deu.traineddata tmp/deu. Extracting tessdata components from deu.traineddata Wrote tmp/deu.version Version string:4.0.0-alpha.20170804 23:version:size=20, offset=192

E:\Tesseract-OCR4.0a2>combine_tessdata -d deu.traineddata Version string:4.0.0-alpha.20170804 23:version:size=20, offset=192

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1060# issuecomment-320557491, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056cltXp_ aF2gKV9kC0kWvs3JtVSFnks5sVn3ggaJpZM4Op9eR .

-- Ray.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1060# issuecomment-320710053, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o0ifee0UQtGx7y9q_ bdy7S-bpU8Gks5sVzkkgaJpZM4Op9eR .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1060#issuecomment-320712052, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056bG0VYKti8qB_bH9oO6Ma4ZkT7B_ks5sVzrtgaJpZM4Op9eR .

-- Ray.

Shreeshrii commented 7 years ago

It seems the majority of the problems are lack of sync of code/data. There are dependencies between code and data that have changed due to moving the unicharset from the LSTM model to the traineddata file.

Yes. That is the problem.

One possible solution that I have been asking for a while is the tagging of "important" commits. Then it would be easy to say, use tesseract, tessdata, langdata as of 4.0.0alpha-20170807

TheSeiko commented 7 years ago

@stweil thank you, removing deu.config helped a lot


ad best traineddata deu without deu.config:

after ~50k testimages: great recognition rate

only problem so far: sometimes i is not recognised properly:

sıch - sich Parıs - Paris

I'm adding a regex to replace ı with i

TheSeiko commented 7 years ago

and j -> J

OCR Result <-> Text in image Jungen - jungen Jury - jury Juries - juries

TheSeiko commented 7 years ago

$$-Jährige <-> $$-jährige SPO - SPÖ

theraysmith commented 7 years ago

Interesting. I remember from learning German at school that all nouns begin with a capital, so why do yours not? I would assume from the errors that you describe that the network has learned that all nouns begin with a capital, so it hallucinates one even when it is not there. If you have a lot of non-capital nouns for some reason, it might do better in 'Latin' than 'deu'

On Tue, Aug 8, 2017 at 11:57 PM, TheSeiko notifications@github.com wrote:

$$-Jährige <-> $$-jährige

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1060#issuecomment-321170268, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056SQxJfQMiipkLavZpbfqC4FgRymLks5sWVhbgaJpZM4Op9eR .

-- Ray.

TheSeiko commented 7 years ago

@theraysmith jährige/Jährige can be both - a noun (capital letter) or an adjective (lowercase): a 42 year old man - ein 42-jähriger Mann a 42 year old - ein 42-Jähriger

Latin is working better with this problem, I've had it running yesterday for ~100k frames Latin has some problems with mutated vowels. i.e.: +------------+---------+--------------------+--------------------+---------------------+ | languageId | ranking | replaceTo | replaceRegex | inputDate | +------------+---------+--------------------+--------------------+---------------------+ | 10 | 10 | Österreich | Osterreich | 2017-08-03 14:45:05 | - DEU without FRAK | 10 | 10 | Paris | Parıs | 2017-08-08 10:52:04 | - DEU without FRAK | 10 | 10 | i | ı | 2017-08-08 13:04:50 | - DEU without FRAK | 10 | 10 | ÖFB-Goalie | OFB-Goalie | 2017-08-08 14:20:24 | - DEU without FRAK | 10 | 10 | Volkspartei | Volksparte!l | 2017-08-09 09:14:11 | - LATIN | 10 | 10 | Eurofighter-Übung | Eurofighter-Ubung | 2017-08-09 09:34:09 | - LATIN | 10 | 10 | Überlebende | Uberlebende | 2017-08-10 08:08:04 | - LATIN | 10 | 10 | Eine | Fine | 2017-08-10 09:04:31 | - LATIN | 10 | 10 | Oberwölz | Oberwòölz | 2017-08-10 10:31:30 | - LATIN | 10 | 10 | Wörter | Wõōrter | 2017-08-10 14:25:23 | - LATIN | 10 | 10 | Wörter | Wōörter | 2017-08-10 14:25:51 | - LATIN | 10 | 10 | Wörter | Wōrter | 2017-08-10 14:26:45 | - LATIN | 10 | 10 | Männer | Māänner | 2017-08-10 15:04:25 | - LATIN +------------+---------+--------------------+--------------------+---------------------+

I've collected some example images and I'll try to do the "Fine Tuning Training"

TheSeiko commented 7 years ago

| 10 | 10 | Arzl-Ost | Arzl-0Ost | 2017-08-11 09:34:41 | - LATIN | 10 | 10 | Ein | Fin | 2017-08-11 09:35:26 | - LATIN | 10 | 10 | Während | Wāährend | 2017-08-11 09:37:20 | - LATIN | 10 | 10 | Oscarprämierter | Oscarprāmierter | 2017-08-11 10:02:07 | - LATIN

TheSeiko commented 7 years ago

| 10 | 1502045216726 | Oberwölz | Oberwõlz | 10 | 1502047625611 | Militärbasis | Militārbasis | 10 | 1502057831054 | www.uncut.at | WWWw.uncut.at | 10 | 1502099066258 | Wörter | Wõörter | 10 | 1502269194006 | Donaupark zum Gratis | Donaupark zum 6Gratis

stweil commented 3 years ago

@TheSeiko, do you have example images which still show this issue? We need them to test a bug fix which was suggested in #3144.

TheSeiko commented 3 years ago

@stweil I'll post some example images asap. It got a lot better but still happens.

TheSeiko commented 3 years ago

One thing I've found out is that sometimes the points, i.e. Ö are used with the previous line. So the Ö is recognised as an O and the previous line has points added. Sometimes the reason for this is that the previous line has a different character size than the following paragraph. But this is only one case.

amitdo commented 3 years ago

One thing I've found out is that sometimes the points, i.e. Ö are used with the previous line. So the Ö is recognised as an O and the previous line has points added. Sometimes the reason for this is that the previous line has a different character size than the following paragraph. But this is only one case.

This looks like a different issue from the original one.

TheSeiko commented 3 years ago

C:\Tesseract-OCR20200328>tesseract --version tesseract v5.0.0-alpha.20200328 leptonica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2 Found AVX Found FMA Found SSE Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5 Found libcurl/7.59.0 OpenSSL/1.0.2o (WinSSL) zlib/1.2.11 WinIDN libssh2/1.7.0 nghttp2/1.31.0


C:\Tesseract-OCR20190314>tesseract --version tesseract v4.0.0.20190314 leptonica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.2.0 Found AVX2 Found AVX Found SSE

TheSeiko commented 3 years ago

pathTesseract: C://Tesseract-OCR20190314/tesseract FrameType: CENTER_SMALL_COLOR ForDeleting: false FrameColor: BLUE_023_060_211

Österreich Jeder dritte Fuf&gaánger-Unfall in Osterreich

passiert auf den Schutzwegen.

pathTesseract: C://Tesseract-OCR20200328/tesseract FrameType: CENTER_SMALL_COLOR ForDeleting: false FrameColor: BLUE_023_060_211

Österreich Jeder dritte Fußgänger-Unfall in Osterreich

passiert auf den Schutzwegen.

20201125160529375_1598603314737_bottom

TheSeiko commented 3 years ago

@stweil - grófste

pathTesseract: C://Tesseract-OCR20200328/tesseract FrameType: RIGHT_0970_COLOR ForDeleting: false FrameColor: CYAN_008_107_102

Fjorde

Der Kangertittivaq ist das grófste Fjord- system der Welt.

Der längste Fjord erstreckt sich über fast 350 Kilometer an Grönlands Ostküste.

20201125170041216_1577519184094_main

TheSeiko commented 3 years ago

@stweil ÓFB-Legionaàr

pathTesseract: C://Tesseract-OCR20200328/tesseract FrameType: CENTER_WHITE ForDeleting: false FrameColor: WHITE

Premier League Die ,Reds" haben weiter eine makellose Bilanz und sind klarer Tabellenführer.

Bei Watford feiert ÓFB-Legionaàr Pródl beim 0:0 gegen Sheffield den ersten Liga-Einsatz seit einem Jahr. Watford bleibt weiter

20201125180010431_1570300709269_main

TheSeiko commented 3 years ago

@stweil A^4

pathTesseract: C://Tesseract-OCR20200328/tesseract FrameType: CENTER_SMALL_COLOR ForDeleting: false FrameColor: BLUE_023_060_211

Ungarn/Üsterreich Lebenslang für die vier Hauptangeklagten

nach dem A^4-Flüchtlingsdrama.

20201125185627625_1561026929372_bottom

TheSeiko commented 3 years ago

@stweil 4 comes from nowhere

pathTesseract: C://Tesseract-OCR20200328/tesseract FrameType: LEFT_1080_COLOR ForDeleting: false FrameColor: RED_215_023_020

Kinoabende Architektur und Urbanismus in Zeiten des Klimawandels.

4

„Die Zukunft reparieren‘ ist das Thema dieses Architekturfilmfestivals.

20201125190031442_1566254197158_main