ryanfb / latinocr-lat

'lat' repository, forked from https://github.com/ryanfb/ancientgreekocr-grc. The final training process for lat.traineddata
https://ryanfb.github.io/latinocr/
Apache License 2.0
13 stars 3 forks source link

Add/test training against more ligature forms #1

Open ryanfb opened 9 years ago

ryanfb commented 9 years ago

Passing --ligatures to text2image gives us st ligatures in Cardo and EB Garamond, as well as ffi ffl (more?) in Garamond. If we could find a way to conditionally turn on the hlig OpenType font feature in Garamond we should be able to get ct ligatures out of it (and maybe Cardo?).

Wyld has ligatures mapped into the ASCII characters: ÌËÊÉÈÇÅÄÃÂÁÀ So we'd need to run text2image against these then substitute them back to the characters we want before running training.

ryanfb commented 9 years ago

For hlig support, it seems like we would need to modify Tesseract's text2image to support a new e.g. --opentype_features argument that could call pango_ot_ruleset_add_feature with the corresponding PangoOTTags (like Pango's syriac-fc.c does).