Open stweil opened 7 years ago
It is unclear who invented the name frk for Frankish. Maybe it should be renamed.
frk is the ISO 639-3 code for Frankish.
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
FYI, the source of the 'Language' column in the tables is the old google code download page. Ray uploaded the official traineddata files to that old page, Zdenko added a few 3rd party files.
I should have explained my question better. Why is German fraktur called Frankish? Neither the characters nor the words and also not the fonts used are Frankish language. And without hints from others I'd never have thought of using frk
for German fraktur.
It seems frk is trained using modern German corpus and a small number of fonts.
@stweil, maybe you want to close this issue?
Do you think that frk
is the right name? Or should it be renamed, maybe deu_old
or deu_frak
(as people are used to that name)? "Frankish" is definitely the wrong description for the current frk
.
Is 'frk' only for German Fraktur?
I expect that the frk
LSTM model will work quite good with Fraktur text in other languages, too. But the word list of frk
is mainly based on German words (I estimate more than 95 % of the 473228 words are German). The list also includes few words from English, Spanish, French, Latin, Russian and other languages. Many of them would not be expected in Fraktur text (jQuery, motherboard, ...). The German words contain lots of the known problems like ß/B, ii/ü and other confusions, lower case substantives (should always be upper case for German), upper case adjectives (should normally be lower case), random words in all upper case, lots of web sites (also not typical for Fraktur) and so on.
@theraysmith, it would be really interesting to know more details of the process which leads to that and also the other word lists. They look like extracts from random web sites. I don't think that good word lists for Fraktur can be produced like that.
Both deu_frak.traineddata and frk.traineddata try to support German fraktur.
deu_frak
is not part of the official tesseract-ocr/langdata, but comes from paalberti/tesseract-dan-fraktur. It does not support the new LSTM recognizer introduced by Tesseract 4, but currently gives better results for fraktur texts thanfrk
(which supports LSTM).frk
can be improved a lot by adding missing characters (primarily the long s, but also paragraph and dollar sign and maybe more) and based on latest corrections for langdata. With an improvedfrk
,deu_frak
would no longer be needed.It is unclear who invented the name
frk
for Frankish. Maybe it should be renamed.