invoice-x / invoice2data

Extract structured data from PDF invoices
MIT License
1.84k stars 482 forks source link

Performance update: match keywords on extracted_str #470

Closed bosd closed 1 year ago

bosd commented 1 year ago

Before this PR for each individual template an optimized string was generated. This impacts the performance negatively, specifically if one has a lot of templates.

This PR greatly increases performance as an optimized_str is only generated on the matched template instead of on all templates.

On my local system, I realized a 2x performance increase.

:warning: Warning: Every performance increase comes at a cost. It might break some templates.

rmilecki commented 1 year ago

Ping @m3nu

m3nu commented 1 year ago

Should be OK to match an invoice before doing the optimizations for extractions. Maybe add a graph of the steps at some point to make it quicker to understand and start?

bosd commented 1 year ago

Maybe add a graph of the steps at some point to make it quicker to understand and start?

I am thinking of something like this:

flowchart LR
    InvoiceFile[fa:fa-file-invoice Invoicefile\n\npdf\nimage\ntext] --> Input-module(Input Module\n\npdftotext\ntext\npdfminer\npdfplumber\ntesseract\ngvision)
    Input-module --> |Extracted Text| C{keyword\nmatching}
    Invoice-Templates[fa:fa-file-lines Invoice Templates] --> C{keyword\nmatching}
    C --> |Extracted Text + fa:fa-file-circle-check Template| E(Template Processing\n apply options from template\nremove accents, replaces etc...)
    E --> |Optimized String|Plugins&Parsers(Call plugins + parsers)
    subgraph Plugins&Parsers
      direction BT
        tables[fa:fa-table tables] ~~~ lines[fa:fa-grip-lines lines]
        lines ~~~ regex[fa:fa-code regex]
        regex ~~~ static[fa:fa-check static]

    end
    Plugins&Parsers --> |output| result[result\nfa:fa-file-csv,\njson,\nXML]

 click Invoice-Templates https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md
 click result https://github.com/invoice-x/invoice2data#usage
 click Input-module https://github.com/invoice-x/invoice2data#installation-of-input-modules
 click E https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#options
 click tables https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#tables
 click lines https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#lines
 click regex https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#regex
 click static https://github.com/invoice-x/invoice2data/blob/master/TUTORIAL.md#parser-static

Will make it in a separate PR so we can discuss it there.