invoice-x / invoice2data

Extract structured data from PDF invoices
MIT License
1.84k stars 482 forks source link

Data Sanitization after match #497

Open bosd opened 1 year ago

bosd commented 1 year ago

I would like to propose a data cleansing / sanitazation step after matching. as commented in: https://github.com/invoice-x/invoice2data/issues/106#issuecomment-435098612

Use Case:

I would like to match a Netherlands vat number Format: 'NL' + 9 digits + B + 2-digit company index – e.g. NL999999999B01 Which translates to:

  vat:
    parser: regex
    regex: (NL\d{9}B\d{2})\s

Input string from OCR'd pdf: VAT NUMBER NL.999,999.999,B01 We get the data, but it includes . and , So the previous mentioned regex won't match :disappointed:

Capturing something like that would need:

  vat:
    parser: regex
    regex: (NL.\d{3}.\d{3}.\d{3}.B\d{2})\s

or maybe use multiple capturing groups, without the . and , and use goup: join

As writing templates is very hard, I prefer it to make it as easy as possible. The ideal regex template for the input string is: regex: VAT NUMBER\s+(\S+)

results in vat: ['NL.999,999.999,B01']

and then have a sanitazation function to strip out the unwanted characters. As we know the value of the vat number should only contain digits and numbers we can replace all the rest. re.sub(r'\W+', '', vat) results in vat: ['NL999999999B01']

What would be the best way to implement this in code?

fields:
  vat:
    parser: regex
    regex: (NL\d{9}B\d{2})\s
    type: str
    # 1. Make replace function available on field level
    replace: ['\W+', '']
    # 2. Make a new santitize option
    sanitize: any_word_character

Option 1: is still not easy to include in a template. But is is very powerfull and flexible. Option 2: is easier to include in the template.

Jopie01 commented 9 months ago

This would be a very cool feature! Please also add it to the different plugins like 'tables' and 'lines'. Because suppliers have different naming for units, I want to be able to replace the units with the name I use in my system. This also means that you have to have a list of possible replacements. Below a part of the Farnell template with added the replace


lines:
    start: 'Lijn Nr'
    end: BELANGRIJK
    first_line: '\d+\s+(?P<code>\d{7})\s+(?P<uom>\w+)\s+(?P<qty>\d+)\s+(?P<price_unit>\d+[.]\d{2,4})\s+(?P<netto_price>\d+[.]\d{2,4})\s+(?P<btw_percent>\d+[.]\d{2})\s+(?P<price_subtotal>\d+[.]\d{2})'
    line: '^\s{9,11}(?P<name>(\S+(?:\s\S+)*))\s+'
    last_line: '\s+(?P<name>(Tariff Code[:]\s+\d+))'
    replace:
      - uom:
          - ['PS', 'unit']  # should be regex
          - ['M', 'meter']
          -  .....
    types:
      qty: float
      price_unit: float
      price_subtotal: float
      netto_price: float
      price_subtotal: float
      btw_percent: float```