silas-hw / NRP-2022-OCR

OCR software project created for a Nuffield Research Placement
MIT License
0 stars 0 forks source link

Determine Quality of Text with Natural Language Processing #5

Open silas-hw opened 2 years ago

silas-hw commented 2 years ago

Determine the quality of text outputted by pytesseract using natural language processing. This will allow for a measure of how good the output text is, and whether the program should continue on to TTS or inform the user that the scanned image was too poor of quality.

A possible package to use for this could be NLTK. An introductory tutorial can be found here

silas-hw commented 2 years ago

The text now gets autocorrected using the autocorrect package.

silas-hw commented 2 years ago

Linear searches using NLTK take far too long, although its ability to find a Jaccard index could come in handy.