Open stweil opened 7 years ago
Can training data from emop project be used to supplement the training for European languages...
http://emop.tamu.edu/TesseractTraining
On 14-Mar-2017 2:08 PM, "Stefan Weil" notifications@github.com wrote:
Many European texts from the 18th century use modern types https://en.wikipedia.org/wiki/History_of_Western_typography with some special properties http://www.orbitals.com/self/ligature/ligature.htm. OCR for such texts is currently only partially supported by Tesseract, notably by enm, frm, ita_old and spa_old (see wiki https://github.com/tesseract-ocr/tesseract/wiki/Data-Files) which are the only models including the long s.
Support is missing for Latin texts (used very often at that time) or German texts, maybe others, too.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/langdata/issues/60, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o4aTmPIrvRIJBtgF1Wh3Gf_mwXANks5rllIQgaJpZM4McQ_J .
Thank you for that link, it might help.
I downloaded the training files. There are both black letter fraktur and regular font images in it. From a cursory look, they may be more appropriate for the legacy engine as many images seem to be random letters rather than proper text.
I am also going to experiment using the Franken plus tool used by emop.
Many European texts from the 18th century use modern types with some special properties. OCR for such texts is currently only partially supported by Tesseract, notably by
enm
,frm
,ita_old
andspa_old
(see wiki) which are the only models including the long s.Support is missing for Latin texts (used very often at that time) or German texts, maybe others, too.