tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
833 stars 888 forks source link

European 18th century texts #60

Open stweil opened 7 years ago

stweil commented 7 years ago

Many European texts from the 18th century use modern types with some special properties. OCR for such texts is currently only partially supported by Tesseract, notably by enm, frm, ita_old and spa_old (see wiki) which are the only models including the long s.

Support is missing for Latin texts (used very often at that time) or German texts, maybe others, too.

Shreeshrii commented 7 years ago

Can training data from emop project be used to supplement the training for European languages...

http://emop.tamu.edu/TesseractTraining

On 14-Mar-2017 2:08 PM, "Stefan Weil" notifications@github.com wrote:

Many European texts from the 18th century use modern types https://en.wikipedia.org/wiki/History_of_Western_typography with some special properties http://www.orbitals.com/self/ligature/ligature.htm. OCR for such texts is currently only partially supported by Tesseract, notably by enm, frm, ita_old and spa_old (see wiki https://github.com/tesseract-ocr/tesseract/wiki/Data-Files) which are the only models including the long s.

Support is missing for Latin texts (used very often at that time) or German texts, maybe others, too.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/60, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o4aTmPIrvRIJBtgF1Wh3Gf_mwXANks5rllIQgaJpZM4McQ_J .

stweil commented 7 years ago

Thank you for that link, it might help.

Shreeshrii commented 7 years ago

I downloaded the training files. There are both black letter fraktur and regular font images in it. From a cursory look, they may be more appropriate for the legacy engine as many images seem to be random letters rather than proper text.

I am also going to experiment using the Franken plus tool used by emop.

http://emop.tamu.edu/outcomes/Franken-Plus

http://www.primaresearch.org/tools/Aletheia/Editions