tesseract-ocr / tessdata

Trained models with fast variant of the "best" LSTM models + legacy models
Apache License 2.0
6.31k stars 2.17k forks source link

Duplicate and incomplete data for German fraktur #49

Open stweil opened 7 years ago

stweil commented 7 years ago

Both deu_frak.traineddata and frk.traineddata try to support German fraktur.

deu_frak is not part of the official tesseract-ocr/langdata, but comes from paalberti/tesseract-dan-fraktur. It does not support the new LSTM recognizer introduced by Tesseract 4, but currently gives better results for fraktur texts than frk (which supports LSTM).

frk can be improved a lot by adding missing characters (primarily the long s, but also paragraph and dollar sign and maybe more) and based on latest corrections for langdata. With an improved frk, deu_frak would no longer be needed.

It is unclear who invented the name frk for Frankish. Maybe it should be renamed.

amitdo commented 7 years ago

It is unclear who invented the name frk for Frankish. Maybe it should be renamed.

frk is the ISO 639-3 code for Frankish.

amitdo commented 7 years ago

https://github.com/tesseract-ocr/tesseract/wiki/Data-Files

FYI, the source of the 'Language' column in the tables is the old google code download page. Ray uploaded the official traineddata files to that old page, Zdenko added a few 3rd party files.

stweil commented 7 years ago

I should have explained my question better. Why is German fraktur called Frankish? Neither the characters nor the words and also not the fonts used are Frankish language. And without hints from others I'd never have thought of using frk for German fraktur.

amitdo commented 7 years ago

It seems frk is trained using modern German corpus and a small number of fonts.

amitdo commented 7 years ago

@stweil, maybe you want to close this issue?

stweil commented 7 years ago

Do you think that frk is the right name? Or should it be renamed, maybe deu_old or deu_frak (as people are used to that name)? "Frankish" is definitely the wrong description for the current frk.

amitdo commented 7 years ago

Is 'frk' only for German Fraktur?

stweil commented 7 years ago

I expect that the frk LSTM model will work quite good with Fraktur text in other languages, too. But the word list of frk is mainly based on German words (I estimate more than 95 % of the 473228 words are German). The list also includes few words from English, Spanish, French, Latin, Russian and other languages. Many of them would not be expected in Fraktur text (jQuery, motherboard, ...). The German words contain lots of the known problems like ß/B, ii/ü and other confusions, lower case substantives (should always be upper case for German), upper case adjectives (should normally be lower case), random words in all upper case, lots of web sites (also not typical for Fraktur) and so on.

@theraysmith, it would be really interesting to know more details of the process which leads to that and also the other word lists. They look like extracts from random web sites. I don't think that good word lists for Fraktur can be produced like that.