invoice-x / invoice2data

Extract structured data from PDF invoices
MIT License
1.84k stars 482 forks source link

Separate type conversion from field extraction #198

Open rseabrook opened 5 years ago

rseabrook commented 5 years ago

When creating the tables plugin, I found that type conversion was tightly linked to field extraction in the core lib. There was no easy way to re-use the type conversion logic, so I mostly duplicated the type conversion logic inside the plugin. The lines plugin does much the same thing with yet another implementation of type conversion.

Would you be in favor of refactoring to separate type conversion from field extraction, @m3nu? That would allow the plugins to take advantage of type conversion in the core and simplify the plugin code. It would give sum field support to the plugins by default (#178) and have benefits for exception handling (#190). I wanted to get your feedback before sinking much time into the refactor.

m3nu commented 5 years ago

As we add more functionality, separating it into different modules and functions is surely a good idea. Currently we have those three stages (1 stage = 1 package here): input, extraction, output.

Your stage ("conversion"?) would go after "extraction"? Would it process one field at a time or everything at once after extraction? Like "post-extraction"?

rseabrook commented 5 years ago

I was thinking that the extraction stage would create a full result before passing it to conversion. There are two different type conversion methods/conventions right now.

  1. Naming convention like date_ or amount_.
  2. Explicit type definition in the templates as introduced in the lines plugin.

The conversion stage would walk through the result and operate on each field as dictated by the template or field naming convention.