Closed bosd closed 1 year ago
Ping @m3nu
Should be OK to match an invoice before doing the optimizations for extractions. Maybe add a graph of the steps at some point to make it quicker to understand and start?
Maybe add a graph of the steps at some point to make it quicker to understand and start?
I am thinking of something like this:
flowchart LR
InvoiceFile[fa:fa-file-invoice Invoicefile\n\npdf\nimage\ntext] --> Input-module(Input Module\n\npdftotext\ntext\npdfminer\npdfplumber\ntesseract\ngvision)
Input-module --> |Extracted Text| C{keyword\nmatching}
Invoice-Templates[fa:fa-file-lines Invoice Templates] --> C{keyword\nmatching}
C --> |Extracted Text + fa:fa-file-circle-check Template| E(Template Processing\n apply options from template\nremove accents, replaces etc...)
E --> |Optimized String|Plugins&Parsers(Call plugins + parsers)
subgraph Plugins&Parsers
direction BT
tables[fa:fa-table tables] ~~~ lines[fa:fa-grip-lines lines]
lines ~~~ regex[fa:fa-code regex]
regex ~~~ static[fa:fa-check static]
end
Plugins&Parsers --> |output| result[result\nfa:fa-file-csv,\njson,\nXML]
click Invoice-Templates https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md
click result https://github.com/invoice-x/invoice2data#usage
click Input-module https://github.com/invoice-x/invoice2data#installation-of-input-modules
click E https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#options
click tables https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#tables
click lines https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#lines
click regex https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#regex
click static https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#parser-static
Will make it in a separate PR so we can discuss it there.
Before this PR for each individual template an optimized string was generated. This impacts the performance negatively, specifically if one has a lot of templates.
This PR greatly increases performance as an optimized_str is only generated on the matched template instead of on all templates.
On my local system, I realized a 2x performance increase.
:warning: Warning: Every performance increase comes at a cost. It might break some templates.