invoice-x / invoice2data

Extract structured data from PDF invoices
MIT License
1.8k stars 476 forks source link

Date Format not being applied #521

Closed vapinv closed 1 year ago

vapinv commented 1 year ago

I'm using a custom template to extract data from an invoice. The template finds the correct date, but then a local timezone is found and the parser replaces the regex data and fails to format it.

I'm new to coding so I could very well have set the date_formats wrong, but everything I find online and in the template folders seems to indicate it is correct. I tried adding the parser and type to the date field, but it still didn't work.

I haven't setup a python script yet, this is me just trying to ensure everything works by running it through a bash terminal first. I've checked a ton of documentation on dateutils, utils, pyty, tz, and dateparser and I'm no closer to solving this on my own. Any assistance on fixing this would be greatly appreciated.

My template:

issuer: Rents
keywords:
- Rents Stuff LLC
exclude_keywords:
- STATEMENT
- RENTAL\s+INVOICE
- fields:
   amount: Total:\s+(\d+.\d+\.\d+)
   invoice_number: Invoice\s+Num\s+(SI\-\d+)
   date: 
      parser: regex
      regex: Invoice\s+Date:\s+(\d{2}\/\d{2}\/\d{2})
      type: date
options:
  currency: USD
  date_formats:
       - '%m-%d-%y'

Results of debugging:

DEBUG:invoice2data.extract.parsers.regex: field=date | regex=Invoice\s+Date:\s+(\d{2}\/\d{2}\/\d{2}) | matches=['08/07/23']
DEBUG:tzlocal: /etc/timezone found, contents:
    America/Los_Angeles

DEBUG:tzlocal: /etc/localtime found
DEBUG:tzlocal: 2 found:
    {'/etc/timezone': 'America/Los_Angeles', '/etc/localtime is a symlink to': 'America/Los_Angeles'}
DEBUG:invoice2data.extract.invoice_template: result of date parsing=2023-08-07 00:00:00
DEBUG:invoice2data.extract.invoice_template: 
    { 'amount': 1978.23,
       'currency': 'USD',
       'date': datetime.datetime(2023, 8, 7, 0, 0),
       'desc': 'Invoice from Rents',
       'invoice_number': 'SI-68749',
       'issuer': 'Rents'}
INFO:root: {'issuer': 'Rents', 'amount': 1978.23, 'invoice_number': 'SI-68749', 'date': datetime.datetime(2023, 8, 7, 0, 0), 'currency': 'USD', 'desc': 'Invoice from Rents'}
rmilecki commented 1 year ago

Your invoice contains date 08/07/23. Three values separated with /.
You tell to try to parse it using %m-%d-%y format which is made of 3 values separated with -.

That hint clearly can't be used as suggested format doesn't match parsed value format. Fix separator in suggested format to match separator used in actual invoice.

vapinv commented 1 year ago

Thank you for your assistance. Changing the hint does not fix the issue. I tried both '%m/%d/%y' and '%D' as seen below:

DEBUG:invoice2data.extract.invoice_template: END optimized_str ==========================
DEBUG:invoice2data.extract.invoice_template: Date parsing: languages=[] date_formats=['%m/%d/%y']
DEBUG:invoice2data.extract.invoice_template: Float parsing: decimal separator=[.]
DEBUG:invoice2data.extract.invoice_template: keywords=['Rents Stuff LLC']
DEBUG:invoice2data.extract.invoice_template: {'remove_whitespace': False, 'remove_accents': False, 'lowercase': False, 'currency': 'USD', 'date_formats': ['%m/%d/%y'], 'languages': [], 'decimal_separator': '.', 'replace': []}
DEBUG:invoice2data.extract.parsers.regex: field=amount | regex=Total:\s+(\d+.\d+\.\d+) | matches=['1,978.23']
DEBUG:invoice2data.extract.parsers.regex: field=invoice_number | regex=Invoice\s+Num\s+(SI\-\d+) | matches=['SI-68749']
DEBUG:invoice2data.extract.parsers.regex: field=date | regex=Invoice\s+Date:\s+(\d{2}\/\d{2}\/\d{2}) | matches=['08/07/23']
DEBUG:tzlocal: /etc/timezone found, contents:
 America/Los_Angeles

DEBUG:tzlocal: /etc/localtime found
DEBUG:tzlocal: 2 found:
 {'/etc/timezone': 'America/Los_Angeles', '/etc/localtime is a symlink to': 'America/Los_Angeles'}
DEBUG:invoice2data.extract.invoice_template: result of date parsing=2023-08-07 00:00:00
DEBUG:invoice2data.extract.invoice_template: 
 { 'amount': 1978.23,
  'currency': 'USD',
  'date': datetime.datetime(2023, 8, 7, 0, 0),
  'desc': 'Invoice from Rents',
  'invoice_number': 'SI-68749',
  'issuer': 'Rents'}
INFO:root: {'issuer': 'Rents', 'amount': 1978.23, 'invoice_number': 'SI-68749', 'date': datetime.datetime(2023, 8, 7, 0, 0), 'currency': 'USD', 'desc': 'Invoice from Rents'}

and


DEBUG:invoice2data.extract.invoice_template: Date parsing: languages=[] date_formats=['%D']
DEBUG:invoice2data.extract.invoice_template: Float parsing: decimal separator=[.]
DEBUG:invoice2data.extract.invoice_template: keywords=['Rents Stuff LLC']
DEBUG:invoice2data.extract.invoice_template: {'remove_whitespace': False, 'remove_accents': False, 'lowercase': False, 'currency': 'USD', 'date_formats': ['%D'], 'languages': [], 'decimal_separator': '.', 'replace': []}
DEBUG:invoice2data.extract.parsers.regex: field=amount | regex=Total:\s+(\d+.\d+\.\d+) | matches=['1,978.23']
DEBUG:invoice2data.extract.parsers.regex: field=invoice_number | regex=Invoice\s+Num\s+(SI\-\d+) | matches=['SI-68749']
DEBUG:invoice2data.extract.parsers.regex: field=date | regex=Invoice\s+Date:\s+(\d{2}\/\d{2}\/\d{2}) | matches=['08/07/23']
DEBUG:tzlocal: /etc/timezone found, contents:
 America/Los_Angeles

DEBUG:tzlocal: /etc/localtime found
DEBUG:tzlocal: 2 found:
 {'/etc/timezone': 'America/Los_Angeles', '/etc/localtime is a symlink to': 'America/Los_Angeles'}
DEBUG:invoice2data.extract.invoice_template: result of date parsing=2023-08-07 00:00:00
DEBUG:invoice2data.extract.invoice_template: 
 { 'amount': 1978.23,
  'currency': 'USD',
  'date': datetime.datetime(2023, 8, 7, 0, 0),
  'desc': 'Invoice from Rents',
  'invoice_number': 'SI-68749',
  'issuer': 'Rents'}
INFO:root: {'issuer': 'Rents', 'amount': 1978.23, 'invoice_number': 'SI-68749', 'date': datetime.datetime(2023, 8, 7, 0, 0), 'currency': 'USD', 'desc': 'Invoice from Rents'}