paalberti / tesseract-dan-fraktur

Tesseract ocr training data for Danish written in fraktur script and a few other languages
Other
17 stars 9 forks source link

Size of corpus? #5

Closed AndBM closed 6 years ago

AndBM commented 6 years ago

Do you know the approximate size of the corpus you have trained on?

paalberti commented 6 years ago

The images are in the repository. If you meant the wordlists, the Danish one is based on texts of about 800,000 words in total and the German one is based on texts of about 400,000 words in total. The Swedish is larger but as it isn't mine, I don't have the size of the original texts. There is no real training done on wordlists, though.