hc-sc-ocdo-bdpd / file-processing

A metadata extraction tool for various file types
https://hc-sc-ocdo-bdpd.github.io/file-processing-tools/
MIT License
5 stars 3 forks source link

Alternative OCR Libraries #111

Open pgaviganHC opened 1 year ago

pgaviganHC commented 1 year ago

It may be worth trying some alternative OCR libraries, as discussed in this article: https://www.statcan.gc.ca/en/data-science/network/character-recognition

Might be a good idea to have these alternatives be available options in this library.

benjaminLuoHC commented 10 months ago

Based on my understanding, Tesseract (the current OCR library) excels for documents while CRAFT (the model mentioned in the link) is for text-detection in more complicated images. They work well in conjunction

Another option is EasyOCR which does not require any external dependencies (Issue #4). It uses CRAFT for text-detection and then its own OCR engine so it's essentially Tesseract+CRAFT in a single library, but potentially less powerful due to it being more light-weight

pgaviganHC commented 10 months ago

Based on my understanding, Tesseract (the current OCR library) excels for documents while CRAFT (the model mentioned in the link) is for text-detection in more complicated images. They work well in conjunction

Another option is EasyOCR which does not require any external dependencies (Issue #4). It uses CRAFT for text-detection and then its own OCR engine so it's essentially Tesseract+CRAFT in a single library, but potentially less powerful due to it being more light-weight

Good insight here, thanks. I wonder if we could quantify the performance difference between these options with a simple test of some sort?

BrennanHCSC commented 9 months ago
benjaminLuoHC commented 7 months ago

EasyOCR has been recommended to us by Microsoft for use in MS Fabric (the current Tesseract implementation cannot be installed in the MS Fabric environment)

Need to do a security check on EasyOCR