manisandro / gImageReader

A Gtk/Qt front-end to tesseract-ocr.
GNU General Public License v3.0
1.63k stars 190 forks source link

'Long s problem' in 18th century French texts #551

Closed fremont444 closed 2 years ago

fremont444 commented 2 years ago

Anyone found a solution to the 'long s' problem when OCR-ing early French texts? i.e. 'long s' comes out as an 'f'

If you copy and paste text from pdfs in Okular this problem disappears. Anyone know why?

manisandro commented 2 years ago

This is related to tesseract resp. the tessdatas, gImageReader is just a front-end and does not do any recognition itself.