invoice-x / invoice2data

Extract structured data from PDF invoices
MIT License
1.8k stars 476 forks source link

Process OCR document with custom fields #154

Open ghost opened 6 years ago

m3nu commented 6 years ago

When using the library in Python, you need to load the plugin folder. By default it will load the built-in plugins only. The relevant function is read_templates, which is passed to the extraction function later. See here for an example.

If the steps to set up the Python code aren't mentioned in the README or tutorial yet, I agree they should be added.

chriswakare commented 6 years ago

@Mad-u1 Let me know if the following code snippet helps? "my_template_folder_name" is the custom template folder

from invoice2data import extract_data
from invoice2data.extract.loader import read_templates

read_template = read_templates(folder="my_template_folder_name")
result = extract_data('inv001.pdf', read_template)
print(result)
chriswakare commented 6 years ago

At a glance, following fields are a must in the yml file to read details from your pdf invoice :

amount date invoice_number

You can see a list of templates at extract/templates/com/ Pick up one that is nearest to your template and modify

PS: Noticed that your sample pdf is not an invoice and does not have the mandatory fields of an invoice

chriswakare commented 6 years ago

As I recall, date has to be a valid date as the base code checks for a valid date I never tried out amount as a non-decimal datatype - but looking at the output of my sample code - the JSON output for amount is a number

chriswakare commented 6 years ago

Quick questions: Did you create a new template for the shared pdf? Can you share? Did the program give any other errors?

Looking at the shared pdf ui.. it appears ( I may be wrong though) that the pdf was created by embedding a scanned image .. Not sure if the program reads content from an embedded image. Lets see what @m3nu and others have to say.

m3nu commented 6 years ago

yes, this pdf only has the PO number. if I enter any dummy data from the pdf in amount,date and invoice number to satisfy condition, then can I extract the PO number there?

We recently added a setting to specify required fields. If you only want the PO number that should be possible. Maybe there is still a bug left when logging the found field. Will reconfirm that.

yes it is not working for embedded images. Have tried with tesseract also, but it seems tesseracts works with *.jpg files. not with pdf files.

For images you need to use one of the OCR modules. There is tesseract and also a Google Cloud OCR module (needs their API key). The default pdftotext input module will not work.

m3nu commented 6 years ago

I see. That might be a limitation of the current workflow. Tesseract needs an image as input and we do some enhancements to the image before passing it on. To make it work with an image embedded in a PDF, the image needs to be extracted or PDF converted first. The latter may be most useful.

Leaves this todo list:

chriswakare commented 6 years ago

Found the following link on https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality Have not tried it though.

"Detecting rotation and line spacing of image of page of text using Radon transform " https://gist.github.com/endolith/334196bac1cac45a4893#

m3nu commented 6 years ago

The previous commit by @duskybomb was incomplete and didn't fully resolve required fields when loading templates. That was the first issue you are facing.

After fixing a few things in this commit, this command (run in your test folder)

invoice2data --template-folder . --debug --input-reader tesseract test_1.pdf

returns what you need:

DEBUG:root:{'issuer': 'Walmart', 'po_number': '4907455723', 'currency': 'USD', 'desc': 'Invoice from Walmart'}

The desc field is hardcoded. We may want a way to override it in the template.

The final temlplate I used: (your com.walmart.yml)

issuer: Walmart
fields:
  po_number: P0NumberDept.number\n(\d{10})
required_fields:
  - po_number
keywords:
  - Walmart
options:
  currency: USD
  remove_whitespace: true

Other notes:

Hope this helps. I pushed some fixes to the invoice2data package for this to fix PR #113. Be sure to update before trying this. Will be in the repo in 30 min or so.

m3nu commented 6 years ago

Found the following link on https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

@chriswakare improving OCR input is not fully solved. Feel free to experiment and make a PR to improve the command used in tesseract.py. This uses imagemagick for now. This would also be the place to convert PDFs to images.

m3nu commented 6 years ago

PS: Directly inputting a PDF works now. Small change to the convert-command. Updated my earlier comment.