Open ghost opened 6 years ago
@Mad-u1 Let me know if the following code snippet helps? "my_template_folder_name" is the custom template folder
from invoice2data import extract_data
from invoice2data.extract.loader import read_templates
read_template = read_templates(folder="my_template_folder_name")
result = extract_data('inv001.pdf', read_template)
print(result)
At a glance, following fields are a must in the yml file to read details from your pdf invoice :
amount date invoice_number
You can see a list of templates at extract/templates/com/ Pick up one that is nearest to your template and modify
PS: Noticed that your sample pdf is not an invoice and does not have the mandatory fields of an invoice
As I recall, date has to be a valid date as the base code checks for a valid date I never tried out amount as a non-decimal datatype - but looking at the output of my sample code - the JSON output for amount is a number
Quick questions: Did you create a new template for the shared pdf? Can you share? Did the program give any other errors?
Looking at the shared pdf ui.. it appears ( I may be wrong though) that the pdf was created by embedding a scanned image .. Not sure if the program reads content from an embedded image. Lets see what @m3nu and others have to say.
yes, this pdf only has the PO number. if I enter any dummy data from the pdf in amount,date and invoice number to satisfy condition, then can I extract the PO number there?
We recently added a setting to specify required fields. If you only want the PO number that should be possible. Maybe there is still a bug left when logging the found field. Will reconfirm that.
yes it is not working for embedded images. Have tried with tesseract also, but it seems tesseracts works with *.jpg files. not with pdf files.
For images you need to use one of the OCR modules. There is tesseract and also a Google Cloud OCR module (needs their API key). The default pdftotext
input module will not work.
I see. That might be a limitation of the current workflow. Tesseract needs an image as input and we do some enhancements to the image before passing it on. To make it work with an image embedded in a PDF, the image needs to be extracted or PDF converted first. The latter may be most useful.
Leaves this todo list:
Found the following link on https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality Have not tried it though.
"Detecting rotation and line spacing of image of page of text using Radon transform " https://gist.github.com/endolith/334196bac1cac45a4893#
The previous commit by @duskybomb was incomplete and didn't fully resolve required fields when loading templates. That was the first issue you are facing.
After fixing a few things in this commit, this command (run in your test folder)
invoice2data --template-folder . --debug --input-reader tesseract test_1.pdf
returns what you need:
DEBUG:root:{'issuer': 'Walmart', 'po_number': '4907455723', 'currency': 'USD', 'desc': 'Invoice from Walmart'}
The desc
field is hardcoded. We may want a way to override it in the template.
The final temlplate I used: (your com.walmart.yml)
issuer: Walmart
fields:
po_number: P0NumberDept.number\n(\d{10})
required_fields:
- po_number
keywords:
- Walmart
options:
currency: USD
remove_whitespace: true
Other notes:
Walmart
isn't specific enough. If you only use that one template it will be ok though. Just be sure to exclude all built-in templates. (--exclude-built-in-templates
)--debug
param to debug the template. It will show you the actual text Tesseract returns.Hope this helps. I pushed some fixes to the invoice2data
package for this to fix PR #113. Be sure to update before trying this. Will be in the repo in 30 min or so.
Found the following link on https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
@chriswakare improving OCR input is not fully solved. Feel free to experiment and make a PR to improve the command used in tesseract.py
. This uses imagemagick for now. This would also be the place to convert PDFs to images.
PS: Directly inputting a PDF works now. Small change to the convert-command. Updated my earlier comment.
When using the library in Python, you need to load the plugin folder. By default it will load the built-in plugins only. The relevant function is
read_templates
, which is passed to the extraction function later. See here for an example.If the steps to set up the Python code aren't mentioned in the README or tutorial yet, I agree they should be added.