Open bosd opened 1 year ago
This would be a very cool feature! Please also add it to the different plugins like 'tables' and 'lines'. Because suppliers have different naming for units, I want to be able to replace the units with the name I use in my system. This also means that you have to have a list of possible replacements. Below a part of the Farnell template with added the replace
lines:
start: 'Lijn Nr'
end: BELANGRIJK
first_line: '\d+\s+(?P<code>\d{7})\s+(?P<uom>\w+)\s+(?P<qty>\d+)\s+(?P<price_unit>\d+[.]\d{2,4})\s+(?P<netto_price>\d+[.]\d{2,4})\s+(?P<btw_percent>\d+[.]\d{2})\s+(?P<price_subtotal>\d+[.]\d{2})'
line: '^\s{9,11}(?P<name>(\S+(?:\s\S+)*))\s+'
last_line: '\s+(?P<name>(Tariff Code[:]\s+\d+))'
replace:
- uom:
- ['PS', 'unit'] # should be regex
- ['M', 'meter']
- .....
types:
qty: float
price_unit: float
price_subtotal: float
netto_price: float
price_subtotal: float
btw_percent: float```
I would like to propose a data cleansing / sanitazation step after matching. as commented in: https://github.com/invoice-x/invoice2data/issues/106#issuecomment-435098612
Use Case:
I would like to match a Netherlands vat number Format: 'NL' + 9 digits + B + 2-digit company index – e.g. NL999999999B01 Which translates to:
Input string from OCR'd pdf:
VAT NUMBER NL.999,999.999,B01
We get the data, but it includes.
and,
So the previous mentioned regex won't match :disappointed:Capturing something like that would need:
or maybe use multiple capturing groups, without the
.
and,
and usegoup: join
As writing templates is very hard, I prefer it to make it as easy as possible. The ideal regex template for the input string is:
regex: VAT NUMBER\s+(\S+)
results in vat:
['NL.999,999.999,B01']
and then have a sanitazation function to strip out the unwanted characters. As we know the value of the vat number should only contain digits and numbers we can replace all the rest.
re.sub(r'\W+', '', vat)
results in vat:['NL999999999B01']
What would be the best way to implement this in code?
Option 1: is still not easy to include in a template. But is is very powerfull and flexible. Option 2: is easier to include in the template.