itwood / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
0 stars 0 forks source link

German Fraktur-OCR suboptimal recognition because of wrong dictionary #674

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. try to OCR a German language pre-1900 fraktur book. Pre-1900 books and even 
most after-1900 fraktur books differ from non-fraktur by having the letter 'ſ' 
(long-s) in half of the cases where 's' is used. The usage of the letter 'ſ' 
ceased with the upcoming of Antiqua fonts where 'ſ' was replaced by 's'. Ref.: 
http://en.wikipedia.org/wiki/Long_s

What is the expected output? What do you see instead?
I expect 's' recognized as 's', 'ſ' as 'ſ', and 'f' as 'f'. However, 'ſ' is 
converted internally to 's' so I never can tell which of the 's'es was a 'ſ'.

What version of the product are you using? On what operating system?
3.01 on Mac and Linux

Please provide any additional information below.
1. This may affect other Fraktur OCR too. I have not looked into it.

2. Obviously, for simplicity, an Antiqua dictionary is used where a Fraktur 
dictionary is needed. But this leads to a higher error rate because unknown (to 
the dictionary) words with 'ſ' are 'corrected' to known words with 's', 
erroneously.

Original issue reported on code.google.com by gtrw...@gmail.com on 12 Apr 2012 at 2:08

GoogleCodeExporter commented 9 years ago
The fraktur data files that I trained all use a regular s for the long s. 
That's Danish, Swedish and German. Basically, that's because I have no use for 
long s'es and if had distinguished between the two types of s, I would have to 
run a replacement filter afterwards converting long s'es to regular s'es. Extra 
work for no gain. I also think is a common usage scenario (that is, converting 
texts to the corresponding Antigue counterparts, since that's all people read 
nowadays). So I won't be the one putting in the legwork to creating training 
data with long s'es. You are, of course, welcome to this yourself, if you are 
up to it.

If you do look into this yourself, you can start with the files, I used, which 
are available at
https://github.com/paalberti/tesseract-dan-fraktur
The dictionary is just a small part of it, though. You will have to edit all 
the tif/box-files. There is some guidance on editing this type of files in the 
wiki pages. And I should note that my files have been updated to work on 
tesseract version 3.01 but not yet for version 3.02. That's the next thing I've 
got planned for them, but there always seems to be something more important to 
do, so I won't get around to that right away.

Just out of curiosity, what do you need the long s for?

Original comment by dsl602...@vip.cybercity.dk on 12 Apr 2012 at 3:37

GoogleCodeExporter commented 9 years ago
There are repercussions of this ſ/s-simplification that decrease recognition 
accuracy, IMHO.

There is the issue of training, where you try to squeeze two very different 
patterns (ſ/s) into one slot, presumably reducing recognition accuracy. 
Example where I think this happens are words which differ only by ſ/s, like 
Luſt/Luft. I suspect misrecognition to happen in both directions here, and the 
empirical look at ocr'ed text is a first confirmation.

Much more important however are those words which differ only in f/s but where 
there exists no ſ-word. Most notorious example is the very often happening 
error auf/aus. If there were a ſ-dictionary, and tesseract would see that 
there exists 'auf' but not 'auſ' it could decide much better for/against 'auf' 
in case of a dubious f/ſ. I am not a tesseract developer so I hope I'm not 
only handwaving here. I base the argument on the observation that the 
dictionary matters when doing recognition: often with a bad image, dictionary 
words are recognized, so a ſ-dictionary would matter in this case.

On the other hand you have bulk-converting ſ to s which is easily and quickly 
done. I would never trade a possible gain in recognition with such a 
programming one-liner.

Original comment by gtrw...@gmail.com on 12 Apr 2012 at 5:08

GoogleCodeExporter commented 9 years ago
You may have a valid point with the aus/auſ/auf-example. I am certainly not a 
tesseract developer so no comment from me on the other issue you mention.

But still, if you'd like to see this changed any time soon, your best chance is 
to change it yourself. There are so many things that ought to be tweaked there, 
this is not on the top of my list, and it's been some time since I've found the 
time and energy to look at any of them.

It is quite likely that you can make this work better than I could, so I hope 
you do try.

Original comment by dsl602...@vip.cybercity.dk on 14 Apr 2012 at 8:15

GoogleCodeExporter commented 9 years ago
I hear you. For the case I can make this possible I will add a third reason to 
differentiate the dictionary in the German-Fraktur version:

Most fraktur books in german language were printed before 1901, when there 
happened the second "orthographic reform". I have only a german link for this:
http://de.wikipedia.org/wiki/Orthographische_Konferenz_von_1901

But even earlier in 1876, there was also another reform, see
http://de.wikipedia.org/wiki/I._Orthographische_Konferenz

These led to marked changes in the dictionary, like sz -> ß, some ss -> ß, 
many th -> t, c -> z. Quite a few changes cannot be described by simple rules.

The package, as it is now, uses a modern dictionary, which works well with 
Antiqua fonts, on fraktur texts printed before 1901, leading to two results: 1. 
correct words are not found in dict, so they must be recognized letter by 
letter each time, thus increasing the error rate; 2. with bad images, wrong 
dict words are picked out of the noise by the comparison with the dictionary.

So what is needed additionally is a version with a pre-1901 dictionary.

This is well known but noone bothered because ABBYY provided a workable 
solution with a previous version of their engine. Now this is discontinued, 
people face an expensive web service and begin lamenting. I'm glad I could at 
least verbalize these problems here.

Original comment by gtrw...@gmail.com on 15 Apr 2012 at 5:59

GoogleCodeExporter commented 9 years ago
I have windows XP AND I would like to make my own fractur training data file/s.

Are they taylored to a font on the computer or are they trained to variations 
of fractur fonts from different texts ?

Is it a simple process to train fr\actur fonts from different texts using 
windows XP and Tesseract 3 ?
If possible could I have step by step instructions on how to make fractur 
training data files which give greater accuracy than the fractur training data 
kindly supplied ?

Original comment by nine.ele...@gmail.com on 7 Jun 2012 at 9:37

GoogleCodeExporter commented 9 years ago
@ dsl602230:
What about to create alternative language pack for 's' recognized as 's', 'ſ' 
as 'ſ', and 'f' as 'f'?

Original comment by zde...@gmail.com on 12 Jan 2014 at 2:52

GoogleCodeExporter commented 9 years ago
Thanks for the explanation of why it might be useful to retain the ſ/s 
distinction to later in the recognition process and or completely to the 
output. These are valid points, but we won't be able to address this issue for 
3.04.
The main difficulty is lack of wordlist training data for building the 
dictionaries that relate to older variants of German (and other languages).
With 3.04 the wordlists that we use to build the dictionaries will be opened up 
and provided in the git repository. If you have access to a wordlist for older 
German (perhaps you are an academic working in that field for instance) that 
has suitable licensing, it would be easy to add a parallel language directory 
for older German and train using that data.

Original comment by theraysm...@gmail.com on 4 Nov 2014 at 6:46