tesseract-ocr / tessdata

Trained models with fast variant of the "best" LSTM models + legacy models
Apache License 2.0
6.31k stars 2.17k forks source link

Best traineddata feedback - Fraktur #65

Open stweil opened 7 years ago

stweil commented 7 years ago

From issue #62:

The new files include two files for German Fraktur: best/Fraktur.traineddata and best/frk.traineddata. According to my first tests, both are better than the old deu_frak.traineddata and much better than the old frk.traineddata. There is not a clear winner for the two new files: in some cases -l Fraktur gives better results, in some other cases -l frk is better. Even a 3.05 based Fraktur model still is better for some words, but generally the new LSTM based models win the challenge.

Ray, it would be interesting to know the training differences of the two new Fraktur traineddata files. Did they use different fonts / training material / dictionaries?

stweil commented 7 years ago

The new best/Fraktur.traineddata contains a word list (dictionary) with 897964 entries. It can be extracted like this:

combine_tessdata -u /usr/local/share/tessdata/Fraktur.traineddata Fraktur.
dawg2wordlist Fraktur.lstm-unicharset Fraktur.lstm-word-dawg wordlist

A short (still incomplete) review of that list shows lots of issues:

2017-09-11

amitdo commented 7 years ago

Many words (but not all) occur twice, once in their normal case and once in upper case.

Same as in the old (and most likely new) eng.traineddata. Seems to be normal.

stweil commented 7 years ago

In this case "normal" leads to unwanted effects. Tesseract uses those entries to decide about OCR results, and I see many of those uppercase words in my real OCR results. In most cases they are completely wrong (see for example these historic texts with COMPUTER).

If there is a need for uppercase words in some rare cases, I'd expect that those words could be generated programmatically from the normal form. I see no need to fill the word list with them.

stweil commented 7 years ago

List of important missing characters in Fraktur.lstm-unicharset: paragraph §, tilde ~.