invoice-x / invoice2data

Extract structured data from PDF invoices
MIT License
1.79k stars 476 forks source link

Return extracted_str if no templates found with extract_data() (python) ? #392

Open Whaoo opened 2 years ago

Whaoo commented 2 years ago

Hi,

Is there a way to return the extracted_str (full pdf text in str) if no templates are found for the pdf ?

Saw it in the main.py that in debug extracted_str is exactly what i want to collect, that would save me time rather than calling and storing again pdf2text.

Is there any way to return it in extract_data() if no templates are found for the .pdf ?

Many thanks

bosd commented 1 year ago

Is this what you are looking for? Or get some inspiration from? Did'nt test this.

https://github.com/OCA/edi/pull/399/files#diff-652ac3ae132c668bf2ac61903174bbc0c254c98bf549aac7cad47a515259ed32R70-R128

rmilecki commented 1 year ago

Maybe we could make invoice2data more object oriented?

# Use static method
templates = Invoice2Data.read_templates("templates/")

i2d = Invoice2Data()
try:
    i2d.extract_data("foo.pdf", templates=templates)
except Exception as e:
    print('Failed to extract data: ' + str(e))
    print('Extracted text: ' + i2d.get_extracted_text())
legalsylvain commented 1 year ago

Hi @rmilecki I'm looking for a way to have the detail of the parsing error. (no templates found / missing required feld / ...). For the time being, the information is in the log, but not accessible if using invoice2data as a library.

what I don't understand in your code, is that AFAIK, extract_data doesn't raise an error. Or did I missed something ?

try:
    i2d.extract_data("foo.pdf", templates=templates)
except Exception as e:
    print('Failed to extract data: ' + str(e))
    print('Extracted text: ' + i2d.get_extracted_text())
Whaoo commented 1 year ago

Hi guys, nice to see my question is interesting other people Will try using what you pushed @rmilecki :)