invoice-x / invoice2data

Extract structured data from PDF invoices
MIT License
1.8k stars 476 forks source link

The need for templates #44

Closed courtenayparserr closed 6 years ago

courtenayparserr commented 7 years ago

I havent played with invoice2data yet but i wondered whether you have to have templates?

Is there any way one could maintain a db of possible terms for each distinguishable entity eg. Invoice number = invoice no

In that way it could test all terms until it found a match?

m3nu commented 7 years ago

Don't think that this will work because you can't map the keywords to values without doing proper parsing of the text.

m3nu commented 6 years ago

There are some ways to use machine learning for matching. They produce around 70 to 80% accuracy right now. Probably not enough to use them in a business setting because everything will need to be rechecked. So investing in templates and getting 100% accuracy is probably better for now.

Here some literature if you are interested @courtenayparserr