Open viktor2097 opened 3 years ago
@viktor2097 try this one https://scandocflow.com/pdf-extraction/invoice
@viktor2097 try this one https://scandocflow.com/pdf-extraction/invoice
I'm not sure how this is relevant. EasyOCR works fine, and I can extract the data fine with Google Cloud Vision, pytesseract however struggles to extract proper data, since this project uses pytesseract, maybe it's something the owner is interested in solving since these are pretty standard invoice documents.
@viktor2097 That's a great idea. I'm also trying to replace the Pytesseract sections with EasyOCR.
@viktor2097 That's a great idea. I'm also trying to replace the Pytesseract sections with EasyOCR.
@LordDraagonLive Let me know once you have done 🥇 :)
I have been working on trying to integrate EasyOCR into InvoiceNet and basically provide an option to choose between tesseract and easyocr. However, it looks like EasyOCR doesn't return bounding box coordinates at the word level and instead returns bounding boxes for phrases. The attend-copy-parse framework requires bounding box information at the word level for accurate n-gram generation which tesseract does by default.
I'm not too familiar with the EasyOCR API so if there's a way to get bounding box information at the word-level, please let me know.
@naiveHobo yeah I tried that and faced the same issue and it was extremely slow with easyocr. I just edited the code to use Azure computer vision for now to improve the ocr part a bit and it seems to work.
@naiveHobo yeah I tried that and faced the same issue and it was extremely slow with easyocr. I just edited the code to use Azure computer vision for now to improve the ocr part a bit and it seems to work.
I came to the same conclusion, preparing training data takes way too long with EasyOCR ( We're talking hours with 800 invoices ) running on a 3090.
I managed to edit the code to use Azure instead too. It would be kinda cool if you had the option out-of-box to use a third party OCR engine like Azure or Google Cloud Vision. It's a necessity right now if you want any sort of reliable result good enough for production.
I managed to edit the code to use Azure instead too.
I'd definitely merge it if you open a PR :eyes:
Pytesseract struggles with a lot of invoices, some very big clear text are unable to be read.
This is somewhat addressable by doing some preprocessing in cv like adding blurs, threshold, but requires such an enormous amount of preprocessing to the point where some other invoices might start failing if they go through the same preprocessing.
EasyOCR picks it up fine, but is incredibly slow. I'd be happy to provide some working and non working examples privately.
The text is super clear, but i suspect tesseract struggles with the surrounding border and background. I'm able to remove most of it with preprocessing, but again, then pytesseract performs worse on some other invoices.