tesseract-ocr / langdata

Source training data for Tesseract for lots of languages
Apache License 2.0
826 stars 886 forks source link

Add Wynn, Eth, and Ash to Middle English script so it can also be used for Old English (Latin) #298

Open grantbarrett opened 1 year ago

grantbarrett commented 1 year ago

One of the holes in Tesseract's ability to do quality OCR on the historical texts is that it's missing just three characters that prevent it from reasonably handling Old English Latin-character texts.

If you compare the characters available in the Tesseract trained date for Middle English with the character set of Old English using Latin, you'll see the omissions "Æ æ" (ash), "Ð ð" (eth), and "Ƿ ƿ" (wynn).

https://github.com/tesseract-ocr/langdata_lstm/blob/main/enm/enm.unicharset https://en.wikipedia.org/wiki/Old_English_Latin_alphabet

https://en.wikipedia.org/wiki/Old_English_Latin_alphabet https://en.wikipedia.org/wiki/Eth https://en.wikipedia.org/wiki/Wynn

Admittedly these three will require quite a bit of training to distinguish them from an AE digraph, D d, and P p, respectively, but of course, that's what we do here!

As you can see on their respective Wikipedia pages, we may already have trained data for eth and ash in other languages (Danish and Norwegian, and Icelandic, Faroese, and Khmer, respectively), but there are other letter forms that may need to be accounted for, especially for wynn.

If were were able to make these changes, then we could rename the Middle English trained data to be used for Middle and Old English (Latin), differentiating it clearly from the "enm" three-letter code, and especially for those who associate "Old English" primarily with blackletter script, which this trained data would not be suitable to handle. (Blackletter OCR can be handled by the tools at this link, although they are for older versions of Tesseract https://emop.tamu.edu/.)

stweil commented 1 year ago

It is possible to enhance the existing model with those additional glyphs. The original training was done with artificial training data, but I think that you will get better results with transcribed scans from historic books.