invoice-x / invoice2data

Extract structured data from PDF invoices
MIT License
1.82k stars 478 forks source link

Additional option "Line break" #91

Open AvatarSenju opened 6 years ago

AvatarSenju commented 6 years ago

After using invoice2data and creating templates I faced some complications when handling invoices with lots of space and import data on the same line. this can make using of regex after remove_whitespace: even more difficult so I am suggesting another option to be used before remove_whitespace to give a line break after 2-3 consecutive white spaces which could ease up the process of making new templates for other contributors

PS: I am opening this issue as I could not find anything related with line break in the previous issues or in the tutorial.md

m3nu commented 6 years ago

So you want to insert line breaks when there are too many spaces? Not sure how this helps. Any examples?

rseabrook commented 5 years ago

remove_whitespace is most helpful when your invoice data has a very specific format such as quantity or price or strings of defined length such as order numbers. It becomes difficult when you have free text fields (addresses, names, descriptions) that do not have a defined length or format. It makes more sense to not use remove_whitespace in those circumstances.

If you want to match multiple consecutive white space characters, you can use \s+ or \s{2,}.