Closed me-suzy closed 11 months ago
pytesseract is a simple wrapper around Tesseract (which itself already uses a trainable neural net and thus ML as you propose), thus you should probably redirect your enhancement proposal there. Personally, I do not think that LLMs will be incorporated directly as you propose, but instead should be part of your own implementation relying on Tesseract.
With some custom training, you will already be able to improve the overall detection quality for specific use cases, including extended handling of ligatures etc. Even without training, Tesseract and pytesseract already provide you "low level" access to the OCR results, for example TSV data including confidences or hOCR output, which you can use to feed it into any other post-processing step you like.
also, Tesseract cannot read (OCR) very good this kind of documents, especially if the writing is slanted.
https://archive.org/details/florenski-pavel-iconostasul-scan_202311/page/n33/mode/2up
Sorry, but this still is out of scope for pytesseract. Please discuss such issues on the Tesseract mailing list.
This is a screenshot of a page from the Internet Archive. I tested Mathpix, and it successfully recognizes characters to about 55%. It's ok but I think you should incorporate AI, such as ChatGPT, into the character recognition process to correct words whose characters haven't been recognized correctly. Because ChatGPT or BARD or other AIs know how to recognize language and words and are capable of recognizing and correcting misspelled words. In essence, if a word is missing a few letters or if certain letters are not easily distinguishable, ChatGPT can reconstruct the word and, at the same time, correctly add diacritics.