Open TheSeiko opened 7 years ago
Additional example (wW) VW-Werk -> VwW-Werk
Please test with the latest models available in https://github.com/tesseract-ocr/tessdata/tree/master/best
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Tue, Aug 1, 2017 at 9:45 PM, TheSeiko notifications@github.com wrote:
Additional example (wW) VW-Werk -> VwW-Werk
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1060#issuecomment-319419707, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o9N-1P4rP3fa2mYMcL4zS0Z8LMEYks5sT08WgaJpZM4Op9eR .
Suggested Fix: 1 blob / 1 box should only be 1 outcome / 1 result
1) It won't work with ligatures.
2) With the legacy OCR engine, there is a character segmentation step, and the OCR is done on individual char blobs.
With the new LSTM engine, the OCR is done by the neural network on sequence of pixels in text lines, not on pre-segmented blobs.
The fix for most problems with the LSTM engine is more / better training.
DAS2016 Sildes, 6. Modernization Efforts Page 17 Encyclopedia -> EE-n-c-yy-c-l-o-p-e-d-i-a -> Encyclopedia
I think that for 'in dictionary' words these kind of duplications would be eliminated.
Similar issues: #884 #1011
@Shreeshrii
https://github.com/tesseract-ocr/tessdata/tree/master/best is not working @all
deu.traineddata 19.721 KB best - deu.traineddata 8.427 KB
best trainingdata only delivers empty results
Are you using --oem 1?
you can see the contents of the traineddata by
combine_tessdata -u deu,traineddata
These are probably only lstm models and do not have the legacy engine which is used via --oem 0
@stweil Have you tested the deu model?
Yes, I'm using --oem 1
I'm just switching deu.traineddata in tessdata Old one works without problems, new one -> empty output Checked the download - downloaded File has same size as in the repository.
@stweil - you may need to update the windows binaries on Uni Mannheim site with the latest updates from Ray.
@TheSeiko I haven't personally tested the deu model. WIll check and post result. Wondering whether your Windows binary is old....
Looks like you need both deu and frk models
wget -O ./tess4data-save/deubest.traineddata https://github.com/tesseract-ocr/tessdata/blob/master/best/deu.traineddata?raw=true
sudo cp ./tess4data-save/*.traineddata /usr/share/tesseract-ocr/4.00/tessdata
time tesseract ./tif/phototest.tif stdout --oem 1 -l deu time tesseract ./tif/phototest.tif stdout --oem 1 -l deubest
Page 1
This is a lot of 12 point text to test the
ocr code and see if it works on all types
of file format.
The quick brown dog jumped over the
lazy fox. The quick brown dog jumped
over the lazy fox. The quick brown dog
jumped over the lazy fox. The quick
brown dog jumped over the lazy fox.
real 0m1.633s
user 0m2.032s
sys 0m0.492s
Error opening data file /usr/share/tesseract-ocr/4.00/frk.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'frk'
Page 1
This is a lot of 12 point text to test the
ocr code and see if it works on all types
of file format.
The quick brown dog jumped over the
lazy fox. The quick brown dog jumped
over the lazy fox. The quick brown dog
Jumped over the lazy fox. The quick
brown dog jumped over the lazy fox.
real 0m2.045s
user 0m2.744s
sys 0m0.612s
works on linux - looks for frk traineddata, probably listed in deu.config
Have you tested the deu model?
The new best one? No, I have not tested it yet. I am currently focused on Fraktur where the new results clearly beat the old ones.
[...] update the windows binaries on Uni Mannheim site [...]
I noticed on Linux that "old" Tesseract executables crash with the new traineddata, so I expect that my current Windows binaries would crash, too. Building new ones is on my list.
thank you
[...] update the windows binaries on Uni Mannheim site [...]
The new binaries are now available. I now use semantic versioning, so this is my 4.0.0-alpha.20170804.
Thanks!
I now use semantic versioning, so this is my 4.0.0-alpha.20170804.
:-)
Thank you for the new binaries.
There are still similar errors:
hitzefrei -> 1 x hitzefreii / 1 x hitzefreil
Suggestion: The results are a lot better with 4.0. LSTM than with 3.05.01 but training seams to be difficult. Maybe it would be a good idea to offer a webpage where people could upload example image-files and matching text-files to include them in the training process.
looks for frk traineddata, probably listed in deu.config
@theraysmith, best/deu.traineddata
includes a deu.config
with tessedit_load_sublangs frk
. Why was this dependency added? It is confusing for end users who want to use -l deu
that they need frk.traineddata
, too.
@TheSeiko, maybe you'd get better results for Antiqua text without that frk
dependency (which might be good for texts which also include Fraktur). You can use combine_tessdata
to extract the components of best/deu.traineddata
, remove deu.config and combine the remaining components again in a new file.
Thank you for the tip. Much appreciated!
@stweil Am I doing something wrong?
There's only a version file included in the deu,traineddata when using the binaries from 04.08
E:\Tesseract-OCR4.0a2>combine_tessdata -u deu.traineddata tmp/deu Extracting tessdata components from deu.traineddata Wrote tmp/deu.version Version string:4.0.0-alpha.20170804 23:version:size=20, offset=192
E:\Tesseract-OCR4.0a2>combine_tessdata -u deu.traineddata tmp/deu. Extracting tessdata components from deu.traineddata Wrote tmp/deu.version Version string:4.0.0-alpha.20170804 23:version:size=20, offset=192
E:\Tesseract-OCR4.0a2>combine_tessdata -d deu.traineddata Version string:4.0.0-alpha.20170804 23:version:size=20, offset=192
Looks like I introduced a bug. If the traineddata file doesn't exist, it makes an empty one with a version string in it, instead of complaining about the non-existent file.
On Sun, Aug 6, 2017 at 8:00 PM, TheSeiko notifications@github.com wrote:
@stweil https://github.com/stweil Am I doing something wrong?
There's only a version file included in the deu,traineddata when using the binaries from 04.08
E:\Tesseract-OCR4.0a2>combine_tessdata -u deu.traineddata tmp/deu Extracting tessdata components from deu.traineddata Wrote tmp/deu.version Version string:4.0.0-alpha.20170804 23:version:size=20, offset=192
E:\Tesseract-OCR4.0a2>combine_tessdata -u deu.traineddata tmp/deu. Extracting tessdata components from deu.traineddata Wrote tmp/deu.version Version string:4.0.0-alpha.20170804 23:version:size=20, offset=192
E:\Tesseract-OCR4.0a2>combine_tessdata -d deu.traineddata Version string:4.0.0-alpha.20170804 23:version:size=20, offset=192
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1060#issuecomment-320557491, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056cltXp_aF2gKV9kC0kWvs3JtVSFnks5sVn3ggaJpZM4Op9eR .
-- Ray.
Ray, There have been a number of reports of people not being able to run the english tutorial training. Missing eng.config etc Posted in tesseract-ocr forum
On 07-Aug-2017 9:49 PM, "theraysmith" notifications@github.com wrote:
Looks like I introduced a bug. If the traineddata file doesn't exist, it makes an empty one with a version string in it, instead of complaining about the non-existent file.
On Sun, Aug 6, 2017 at 8:00 PM, TheSeiko notifications@github.com wrote:
@stweil https://github.com/stweil Am I doing something wrong?
There's only a version file included in the deu,traineddata when using the binaries from 04.08
E:\Tesseract-OCR4.0a2>combine_tessdata -u deu.traineddata tmp/deu Extracting tessdata components from deu.traineddata Wrote tmp/deu.version Version string:4.0.0-alpha.20170804 23:version:size=20, offset=192
E:\Tesseract-OCR4.0a2>combine_tessdata -u deu.traineddata tmp/deu. Extracting tessdata components from deu.traineddata Wrote tmp/deu.version Version string:4.0.0-alpha.20170804 23:version:size=20, offset=192
E:\Tesseract-OCR4.0a2>combine_tessdata -d deu.traineddata Version string:4.0.0-alpha.20170804 23:version:size=20, offset=192
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1060# issuecomment-320557491, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056cltXp_ aF2gKV9kC0kWvs3JtVSFnks5sVn3ggaJpZM4Op9eR .
-- Ray.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1060#issuecomment-320710053, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o0ifee0UQtGx7y9q_bdy7S-bpU8Gks5sVzkkgaJpZM4Op9eR .
I just made 3 commits that address some of these issues: Error message for lack of --traineddata arg referring to wiki. Emphasis that the lack of config file is just a warning. Detected non-existent traineddata file in combine_tessdata.
It seems the majority of the problems are lack of sync of code/data. There are dependencies between code and data that have changed due to moving the unicharset from the LSTM model to the traineddata file.
On Mon, Aug 7, 2017 at 9:27 AM, Shreeshrii notifications@github.com wrote:
Ray, There have been a number of reports of people not being able to run the english tutorial training. Missing eng.config etc Posted in tesseract-ocr forum
On 07-Aug-2017 9:49 PM, "theraysmith" notifications@github.com wrote:
Looks like I introduced a bug. If the traineddata file doesn't exist, it makes an empty one with a version string in it, instead of complaining about the non-existent file.
On Sun, Aug 6, 2017 at 8:00 PM, TheSeiko notifications@github.com wrote:
@stweil https://github.com/stweil Am I doing something wrong?
There's only a version file included in the deu,traineddata when using the binaries from 04.08
E:\Tesseract-OCR4.0a2>combine_tessdata -u deu.traineddata tmp/deu Extracting tessdata components from deu.traineddata Wrote tmp/deu.version Version string:4.0.0-alpha.20170804 23:version:size=20, offset=192
E:\Tesseract-OCR4.0a2>combine_tessdata -u deu.traineddata tmp/deu. Extracting tessdata components from deu.traineddata Wrote tmp/deu.version Version string:4.0.0-alpha.20170804 23:version:size=20, offset=192
E:\Tesseract-OCR4.0a2>combine_tessdata -d deu.traineddata Version string:4.0.0-alpha.20170804 23:version:size=20, offset=192
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1060# issuecomment-320557491, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056cltXp_ aF2gKV9kC0kWvs3JtVSFnks5sVn3ggaJpZM4Op9eR .
-- Ray.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1060# issuecomment-320710053, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o0ifee0UQtGx7y9q_ bdy7S-bpU8Gks5sVzkkgaJpZM4Op9eR .
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1060#issuecomment-320712052, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056bG0VYKti8qB_bH9oO6Ma4ZkT7B_ks5sVzrtgaJpZM4Op9eR .
-- Ray.
It seems the majority of the problems are lack of sync of code/data. There are dependencies between code and data that have changed due to moving the unicharset from the LSTM model to the traineddata file.
Yes. That is the problem.
One possible solution that I have been asking for a while is the tagging of "important" commits. Then it would be easy to say, use tesseract, tessdata, langdata as of 4.0.0alpha-20170807
@stweil thank you, removing deu.config helped a lot
ad best traineddata deu without deu.config:
after ~50k testimages: great recognition rate
only problem so far: sometimes i is not recognised properly:
sıch - sich Parıs - Paris
I'm adding a regex to replace ı with i
and j -> J
OCR Result <-> Text in image Jungen - jungen Jury - jury Juries - juries
$$-Jährige <-> $$-jährige SPO - SPÖ
Interesting. I remember from learning German at school that all nouns begin with a capital, so why do yours not? I would assume from the errors that you describe that the network has learned that all nouns begin with a capital, so it hallucinates one even when it is not there. If you have a lot of non-capital nouns for some reason, it might do better in 'Latin' than 'deu'
On Tue, Aug 8, 2017 at 11:57 PM, TheSeiko notifications@github.com wrote:
$$-Jährige <-> $$-jährige
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1060#issuecomment-321170268, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056SQxJfQMiipkLavZpbfqC4FgRymLks5sWVhbgaJpZM4Op9eR .
-- Ray.
@theraysmith jährige/Jährige can be both - a noun (capital letter) or an adjective (lowercase): a 42 year old man - ein 42-jähriger Mann a 42 year old - ein 42-Jähriger
Latin is working better with this problem, I've had it running yesterday for ~100k frames Latin has some problems with mutated vowels. i.e.: +------------+---------+--------------------+--------------------+---------------------+ | languageId | ranking | replaceTo | replaceRegex | inputDate | +------------+---------+--------------------+--------------------+---------------------+ | 10 | 10 | Österreich | Osterreich | 2017-08-03 14:45:05 | - DEU without FRAK | 10 | 10 | Paris | Parıs | 2017-08-08 10:52:04 | - DEU without FRAK | 10 | 10 | i | ı | 2017-08-08 13:04:50 | - DEU without FRAK | 10 | 10 | ÖFB-Goalie | OFB-Goalie | 2017-08-08 14:20:24 | - DEU without FRAK | 10 | 10 | Volkspartei | Volksparte!l | 2017-08-09 09:14:11 | - LATIN | 10 | 10 | Eurofighter-Übung | Eurofighter-Ubung | 2017-08-09 09:34:09 | - LATIN | 10 | 10 | Überlebende | Uberlebende | 2017-08-10 08:08:04 | - LATIN | 10 | 10 | Eine | Fine | 2017-08-10 09:04:31 | - LATIN | 10 | 10 | Oberwölz | Oberwòölz | 2017-08-10 10:31:30 | - LATIN | 10 | 10 | Wörter | Wõōrter | 2017-08-10 14:25:23 | - LATIN | 10 | 10 | Wörter | Wōörter | 2017-08-10 14:25:51 | - LATIN | 10 | 10 | Wörter | Wōrter | 2017-08-10 14:26:45 | - LATIN | 10 | 10 | Männer | Māänner | 2017-08-10 15:04:25 | - LATIN +------------+---------+--------------------+--------------------+---------------------+
I've collected some example images and I'll try to do the "Fine Tuning Training"
| 10 | 10 | Arzl-Ost | Arzl-0Ost | 2017-08-11 09:34:41 | - LATIN | 10 | 10 | Ein | Fin | 2017-08-11 09:35:26 | - LATIN | 10 | 10 | Während | Wāährend | 2017-08-11 09:37:20 | - LATIN | 10 | 10 | Oscarprämierter | Oscarprāmierter | 2017-08-11 10:02:07 | - LATIN
| 10 | 1502045216726 | Oberwölz | Oberwõlz | 10 | 1502047625611 | Militärbasis | Militārbasis | 10 | 1502057831054 | www.uncut.at | WWWw.uncut.at | 10 | 1502099066258 | Wörter | Wõörter | 10 | 1502269194006 | Donaupark zum Gratis | Donaupark zum 6Gratis
@TheSeiko, do you have example images which still show this issue? We need them to test a bug fix which was suggested in #3144.
@stweil I'll post some example images asap. It got a lot better but still happens.
One thing I've found out is that sometimes the points, i.e. Ö are used with the previous line. So the Ö is recognised as an O and the previous line has points added. Sometimes the reason for this is that the previous line has a different character size than the following paragraph. But this is only one case.
One thing I've found out is that sometimes the points, i.e. Ö are used with the previous line. So the Ö is recognised as an O and the previous line has points added. Sometimes the reason for this is that the previous line has a different character size than the following paragraph. But this is only one case.
This looks like a different issue from the original one.
C:\Tesseract-OCR20200328>tesseract --version tesseract v5.0.0-alpha.20200328 leptonica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 Found AVX2 Found AVX Found FMA Found SSE Found libarchive 3.3.2 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5 Found libcurl/7.59.0 OpenSSL/1.0.2o (WinSSL) zlib/1.2.11 WinIDN libssh2/1.7.0 nghttp2/1.31.0
C:\Tesseract-OCR20190314>tesseract --version tesseract v4.0.0.20190314 leptonica-1.78.0 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.2.0 Found AVX2 Found AVX Found SSE
Österreich Jeder dritte Fuf&gaánger-Unfall in Osterreich
Österreich Jeder dritte Fußgänger-Unfall in Osterreich
@stweil - grófste
Fjorde
Der Kangertittivaq ist das grófste Fjord- system der Welt.
@stweil ÓFB-Legionaàr
Premier League Die ,Reds" haben weiter eine makellose Bilanz und sind klarer Tabellenführer.
@stweil A^4
Ungarn/Üsterreich Lebenslang für die vier Hauptangeklagten
@stweil 4 comes from nowhere
Kinoabende Architektur und Urbanismus in Zeiten des Klimawandels.
4
tesseract 4.00.00alpha leptonica-1.74.1 libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.20 : libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0 Win10 64bit - built Uni Mannheim
deu.traineddata - Repeating of characters:
Current Behavior:
ÄGYPTEN -> ÄAGYPTEN Grand-Prix -> Gräand-Prix AUSTRALIEN -> AUSTRAÄLIEN GROSSBRITANNIEN -> GROSSBRITAÄANNIEN
Expected Behavior:
ÄGYPTEN -> ÄGYPTEN Grand-Prix -> Grand-Prix AUSTRALIEN -> AUSTRALIEN GROSSBRITANNIEN -> GROSSBRITANNIEN
Suggested Fix: 1 blob / 1 box should only be 1 outcome / 1 result
Additional Info: Example images are available for posting