tesseract-ocr / langdata_lstm

Data used for LSTM model training
Apache License 2.0
114 stars 152 forks source link

θ in Greek book font rendered as swash form #56

Open nisbet-hubbard opened 7 months ago

nisbet-hubbard commented 7 months ago

IMG_1991

OCR result: ϑεοὶ γὰρ οὔποτ᾽,

This is an ordinary book font used by editions of classical texts. Because the design of its theta, however, this letter is frequently OCR’ed as a swash form and requires manual correction as it stands out from the rest of the text when rendered in other (esp. sans) Greek fonts.

stweil commented 7 months ago

So improved training is necessary. Do you know a freely available computer font which emulates that design? Or is there a ground truth data set which can be used to train recognition of that font?

nisbet-hubbard commented 7 months ago

Yes, there is! There’s two fonts with this sort of theta and rho, both under the open font licence.

GFS Heraklit: the text in the image probably used the italic of this. Scroll down for the download.

GFS Artemisia: in a slightly different style.