Pytesseract is not sufficient for reliable results

naiveHobo / InvoiceNet

Deep neural network to extract intelligent information from invoice documents.

MIT License

2.48k stars 393 forks source link

Pytesseract is not sufficient for reliable results #38

Open viktor2097 opened 3 years ago

viktor2097 commented 3 years ago

Pytesseract struggles with a lot of invoices, some very big clear text are unable to be read.

This is somewhat addressable by doing some preprocessing in cv like adding blurs, threshold, but requires such an enormous amount of preprocessing to the point where some other invoices might start failing if they go through the same preprocessing.

EasyOCR picks it up fine, but is incredibly slow. I'd be happy to provide some working and non working examples privately.

The text is super clear, but i suspect tesseract struggles with the surrounding border and background. I'm able to remove most of it with preprocessing, but again, then pytesseract performs worse on some other invoices.

ocr-avenger commented 3 years ago

@viktor2097 try this one https://scandocflow.com/pdf-extraction/invoice

viktor2097 commented 3 years ago

@viktor2097 try this one https://scandocflow.com/pdf-extraction/invoice

I'm not sure how this is relevant. EasyOCR works fine, and I can extract the data fine with Google Cloud Vision, pytesseract however struggles to extract proper data, since this project uses pytesseract, maybe it's something the owner is interested in solving since these are pretty standard invoice documents.

LordDraagonLive commented 3 years ago

@viktor2097 That's a great idea. I'm also trying to replace the Pytesseract sections with EasyOCR.

vibeeshan025 commented 3 years ago

@viktor2097 That's a great idea. I'm also trying to replace the Pytesseract sections with EasyOCR.

@LordDraagonLive Let me know once you have done 🥇 :)

naiveHobo commented 3 years ago

I have been working on trying to integrate EasyOCR into InvoiceNet and basically provide an option to choose between tesseract and easyocr. However, it looks like EasyOCR doesn't return bounding box coordinates at the word level and instead returns bounding boxes for phrases. The attend-copy-parse framework requires bounding box information at the word level for accurate n-gram generation which tesseract does by default.

I'm not too familiar with the EasyOCR API so if there's a way to get bounding box information at the word-level, please let me know.

LordDraagonLive commented 3 years ago

@naiveHobo yeah I tried that and faced the same issue and it was extremely slow with easyocr. I just edited the code to use Azure computer vision for now to improve the ocr part a bit and it seems to work.

viktor2097 commented 3 years ago

@naiveHobo yeah I tried that and faced the same issue and it was extremely slow with easyocr. I just edited the code to use Azure computer vision for now to improve the ocr part a bit and it seems to work.

I came to the same conclusion, preparing training data takes way too long with EasyOCR ( We're talking hours with 800 invoices ) running on a 3090.

I managed to edit the code to use Azure instead too. It would be kinda cool if you had the option out-of-box to use a third party OCR engine like Azure or Google Cloud Vision. It's a necessity right now if you want any sort of reliable result good enough for production.

naiveHobo commented 3 years ago

I managed to edit the code to use Azure instead too.

I'd definitely merge it if you open a PR :eyes:

naiveHobo commented 3 years ago

@wbtan7 added support for AWS Textract as an alternate OCR option in #56. Install boto3 and setup your AWS credentials according to the documentation to use Textract!