ucd-library / csus-sp-2018-app

MIT License
1 stars 1 forks source link

Add whitelist functionality to OCR configuration #24

Closed DerekMaggio closed 5 years ago

DerekMaggio commented 5 years ago

Per Quinn: This example page includes some French characters (accents). Initial tests indicate these could be whitelisted into the Tesseract character set. This could be very important, and it'd be nice to see some examples of the OCR with and without. For the production version, this would imply looking at how we'd configured tesseract for an OCR, and if that could be done dynamically.

ctlevinsky commented 5 years ago

From the research I've done, all we need to do is add the corresponding .traineddata file to $TESSDATA_PREFIX/tesdata folder.

The file can be found at: https://github.com/tesseract-ocr/tessdata/blob/master/fra.traineddata