hc-sc-ocdo-bdpd / file-processing-tools

A metadata extraction tool for various file types
https://hc-sc-ocdo-bdpd.github.io/file-processing-tools/
MIT License
4 stars 2 forks source link

Alternative OCR Libraries #111

Open pgaviganHC opened 8 months ago

pgaviganHC commented 8 months ago

It may be worth trying some alternative OCR libraries, as discussed in this article: https://www.statcan.gc.ca/en/data-science/network/character-recognition

Might be a good idea to have these alternatives be available options in this library.

benjaminLuoHC commented 6 months ago

Based on my understanding, Tesseract (the current OCR library) excels for documents while CRAFT (the model mentioned in the link) is for text-detection in more complicated images. They work well in conjunction

Another option is EasyOCR which does not require any external dependencies (Issue #4). It uses CRAFT for text-detection and then its own OCR engine so it's essentially Tesseract+CRAFT in a single library, but potentially less powerful due to it being more light-weight

pgaviganHC commented 5 months ago

Based on my understanding, Tesseract (the current OCR library) excels for documents while CRAFT (the model mentioned in the link) is for text-detection in more complicated images. They work well in conjunction

Another option is EasyOCR which does not require any external dependencies (Issue #4). It uses CRAFT for text-detection and then its own OCR engine so it's essentially Tesseract+CRAFT in a single library, but potentially less powerful due to it being more light-weight

Good insight here, thanks. I wonder if we could quantify the performance difference between these options with a simple test of some sort?

BrennanHCSC commented 5 months ago
benjaminLuoHC commented 2 months ago

EasyOCR has been recommended to us by Microsoft for use in MS Fabric (the current Tesseract implementation cannot be installed in the MS Fabric environment)

Need to do a security check on EasyOCR