Open stweil opened 7 years ago
The new best/Fraktur.traineddata
contains a word list (dictionary) with 897964 entries. It can be extracted like this:
combine_tessdata -u /usr/local/share/tessdata/Fraktur.traineddata Fraktur.
dawg2wordlist Fraktur.lstm-unicharset Fraktur.lstm-word-dawg wordlist
A short (still incomplete) review of that list shows lots of issues:
§
(maybe others, too) is missing in that list.°*°*°
or ©
.youtube.com
, Youtube.com
, YouTube.com
, YouTube.COM
and YOUTUBE.COM
. Neither of those entries is common in historic texts which typically use Fraktur.Internet
which typically don't occur in Fraktur texts.Abhiingigkeit
, Abhngigkeit
or Abh/ngigkeit
instead of Abhängigkeit
.ß
and B
. Example: blaB
(wrong) instead of blaß
(correct). See also previous commits like this one for langdata.2017-09-11
ii
instead of the correct ü
(for example "fiir" instead of "für").Many words (but not all) occur twice, once in their normal case and once in upper case.
Same as in the old (and most likely new) eng.traineddata. Seems to be normal.
In this case "normal" leads to unwanted effects. Tesseract uses those entries to decide about OCR results, and I see many of those uppercase words in my real OCR results. In most cases they are completely wrong (see for example these historic texts with COMPUTER
).
If there is a need for uppercase words in some rare cases, I'd expect that those words could be generated programmatically from the normal form. I see no need to fill the word list with them.
List of important missing characters in Fraktur.lstm-unicharset
: paragraph §
, tilde ~
.
From issue #62: