Best traineddata feedback - Fraktur

stweil commented 7 years ago

From issue #62:

The new files include two files for German Fraktur: best/Fraktur.traineddata and best/frk.traineddata. According to my first tests, both are better than the old deu_frak.traineddata and much better than the old frk.traineddata. There is not a clear winner for the two new files: in some cases -l Fraktur gives better results, in some other cases -l frk is better. Even a 3.05 based Fraktur model still is better for some words, but generally the new LSTM based models win the challenge.

Ray, it would be interesting to know the training differences of the two new Fraktur traineddata files. Did they use different fonts / training material / dictionaries?

stweil commented 7 years ago

The new best/Fraktur.traineddata contains a word list (dictionary) with 897964 entries. It can be extracted like this:

combine_tessdata -u /usr/local/share/tessdata/Fraktur.traineddata Fraktur.
dawg2wordlist Fraktur.lstm-unicharset Fraktur.lstm-word-dawg wordlist

A short (still incomplete) review of that list shows lots of issues:

At least the important paragraph character § (maybe others, too) is missing in that list.
The list contains lots of strange "words", for example °*°*° or Â©.
Many words (but not all) occur twice, once in their normal case and once in upper case.
Many entries are root domains like youtube.com, Youtube.com, YouTube.com, YouTube.COM and YOUTUBE.COM. Neither of those entries is common in historic texts which typically use Fraktur.
The list contains modern words like Internet which typically don't occur in Fraktur texts.
Many words seem to be Dutch, French and other languages, but not German.
The list contains words which are definitely wrong, for example Abhiingigkeit, Abhngigkeit or Abh/ngigkeit instead of Abhängigkeit.
Many words are wrong because they confuse ß and B. Example: blaB (wrong) instead of blaß (correct). See also previous commits like this one for langdata.

2017-09-11

The wordlist includes words with ii instead of the correct ü (for example "fiir" instead of "für").

amitdo commented 7 years ago

Many words (but not all) occur twice, once in their normal case and once in upper case.

Same as in the old (and most likely new) eng.traineddata. Seems to be normal.

stweil commented 7 years ago

In this case "normal" leads to unwanted effects. Tesseract uses those entries to decide about OCR results, and I see many of those uppercase words in my real OCR results. In most cases they are completely wrong (see for example these historic texts with COMPUTER).

If there is a need for uppercase words in some rare cases, I'd expect that those words could be generated programmatically from the normal form. I see no need to fill the word list with them.

stweil commented 7 years ago

List of important missing characters in Fraktur.lstm-unicharset: paragraph §, tilde ~.

tesseract-ocr / tessdata

Best traineddata feedback - Fraktur #65