karndeepsingh / ApplicationsBuildWithLLMs

29 stars 20 forks source link

INVOICE PDF #1

Open Anbu8968 opened 6 months ago

Anbu8968 commented 6 months ago

I'm new to using AI, and I'm looking for guidance on how to extract invoice details from PDF files, similar to how it's done for images. Can you provide some suggestions or steps to achieve this? Thanks in advance.

whoatharva commented 1 month ago

I'm new to using AI, and I'm looking for guidance on how to extract invoice details from PDF files, similar to how it's done for images. Can you provide some suggestions or steps to achieve this? Thanks in advance.

The PyPDF2 library is one of the ways you can get text from a PDF without using OCR, as it enables you to read and extract text from each page of non-image based PDF. Where one cannot directly extract texts in case of an image-based PDF, OCR (Optical Character Recognition) may be employed through pytesseract, alongside pdf2image that converts pdf pages to images so as to extract texts out of them instead. So, this method covers both scanned and textual PDFs.