tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
832 stars 888 forks source link

German Fraktur #59

Open amitdo opened 7 years ago

amitdo commented 7 years ago

From https://github.com/tesseract-ocr/tesseract/issues/40

@stweil commented

Are there also new data files planned for old German (deu_frak)? I was surprised that the default English model with LSTM could recognize some words.

@theraysmith commented

I don't think I generated the original deu_frak. I have the fonts to do so with LSTM, but I don't know if I have a decent amount of corpus data to hand. With English at least, the language was different in the days of Fraktur (Ye Olde shoppe). I know German continued to be written in Fraktur until the 1940s, so that might be easier. Or is there an old German that is analogous to Ye Old Shoppe for English

stweil commented

Fraktur was used for an important German newspaper (Reichsanzeiger) until 1945. I'd like to try some pages from that newspaper with Tesseract LSTM. Surprisingly even with the English data Tesseract was able to recognize at least some words written in Fraktur.

There is an Old High German (similar to Old English), but the German translation of the New Testament by Martin Luther (1521) was one of the first major printed books in German, and basically it started the modern German language (High German) which is used until today.

@jbaiter commented

I have a decent amount of corpus data for Fraktur from scanned books at hand, about 500k lines in hOCR files (~50GB with TIF images). I've yet to publish it, but if you have somewhere where I could send/upload it, I'd be glad to.

theraysmith commented

The md file documents the training process in tutorial detail, but line boxes and transcriptions sounds perfect!

300k lines should make it work really well. I would be happy to take it and help you, but we would have to get into licenses, copyright and all that first. For now it might be best to hang on for the instructions.

jbaiter commented

The text is CC0 and the images are CC-BY-NC, so that shouldn't be an issue :-) They're going to be public anyway once I've prepped the dataset for publication.

Related: https://github.com/tesseract-ocr/tessdata/issues/49