madmaze / pytesseract

A Python wrapper for Google Tesseract
Apache License 2.0
5.82k stars 721 forks source link

I think you need to improve character recognition by using and implementing ChatGPT in OCR #525

Closed me-suzy closed 11 months ago

me-suzy commented 11 months ago

This is a screenshot of a page from the Internet Archive. I tested Mathpix, and it successfully recognizes characters to about 55%. It's ok but I think you should incorporate AI, such as ChatGPT, into the character recognition process to correct words whose characters haven't been recognized correctly. Because ChatGPT or BARD or other AIs know how to recognize language and words and are capable of recognizing and correcting misspelled words. In essence, if a word is missing a few letters or if certain letters are not easily distinguishable, ChatGPT can reconstruct the word and, at the same time, correctly add diacritics.

image

stefan6419846 commented 11 months ago

pytesseract is a simple wrapper around Tesseract (which itself already uses a trainable neural net and thus ML as you propose), thus you should probably redirect your enhancement proposal there. Personally, I do not think that LLMs will be incorporated directly as you propose, but instead should be part of your own implementation relying on Tesseract.

With some custom training, you will already be able to improve the overall detection quality for specific use cases, including extended handling of ligatures etc. Even without training, Tesseract and pytesseract already provide you "low level" access to the OCR results, for example TSV data including confidences or hOCR output, which you can use to feed it into any other post-processing step you like.

me-suzy commented 11 months ago

also, Tesseract cannot read (OCR) very good this kind of documents, especially if the writing is slanted.

https://archive.org/details/florenski-pavel-iconostasul-scan_202311/page/n33/mode/2up

stefan6419846 commented 11 months ago

Sorry, but this still is out of scope for pytesseract. Please discuss such issues on the Tesseract mailing list.