invoice-x / invoice2data

Extract structured data from PDF invoices
MIT License
1.84k stars 482 forks source link

[IMP] return all the options #474

Closed legalsylvain closed 1 year ago

legalsylvain commented 1 year ago

hi all,

Well, first thanks for this great library ! I'm just developed a little module on top of this library, and it works like a charm.

I face a minor issue. Some PDF have some date format %d/%m/%y and other have format date %m/%d/%y. So, I used the date_format configuration in my templates to parse correctly the date

options:
  date_formats: '%d/%m/%y'

The problem I face is that the result in json doesn't mention the format of the date. For exemple here is a result :

{
    "issuer": "My Supplier",
    "amount": 671.37,
    "invoice_number": "792437", 
    "date": "07/02/23", 
    "currency": "EUR", 
    "desc": "Invoice from My Supplier",
}

In my application, I can not know if the date is '2023-02-07' or '2023-07-02'. For the time being, as a work around, I duplicated the date_format and added it in the fields section. See :

fields:
  date_format:
    parser: static
    value: '%d/%m/%y'

However, that trivial PR could avoid that duplication.

New Result :

{
    "issuer": "My Supplier",
    "amount": 671.37,
    "invoice_number": "792437",
    "date": "07/02/23",
    "currency": "EUR",
    "desc": "Invoice from My Supplier",
    "options": {
        "remove_whitespace": False,
        "remove_accents": False,
        "lowercase": False,
        "currency": "EUR",
        "date_formats": "%d/%m/%y",
        "languages": [],
    },
}

Please, let me know if I missed something.

rmilecki commented 1 year ago

Hi @legalsylvain, I think you misunderstood how dates are handled.

Specifying date_formats in template options affects how strings with dates are parsed into datetime.date. After parsing they are internally stored as datetime.date.

If you use invoice2data as a library that is exactly what you're going to get. Unformatted date.

If you use invoice2data from command line, internally stored dates may get formatted back to some kind of text. Details depend on used output format (JSON, XML, CSV). To control format used for translating dates into text you can use --output-date-format CLI option. That translation does not get affected by template and date_formats at all.

legalsylvain commented 1 year ago

hi @rmilecki

first, thanks for your answer !

Hi @legalsylvain, I think you misunderstood how dates are handled.

what I missed, in fact, is the "type: date" option. As a result, all my results were strings. By adding this option, I get datetime in my python program that calls this library. so all is OK.

In the meantime, perhaps this PR is still interesting to merge. Knowing the parsing options can be useful for the calling script.

I let you be the judge.

have a nice day!

rmilecki commented 1 year ago

Knowing the parsing options can be useful for the calling script.

I'm not sure about this. One may argue that it may be useful for calling script to know all fields. Or maybe keywords. Or whatever.

Maybe we could just provide a full path to used template? So whatever information needed can be juse read from it? Thinking out loudly...

legalsylvain commented 1 year ago

I'm not sure about this. One may argue that it may be useful for calling script to know all fields. Or maybe keywords. Or whatever.

You probably right. In any case, this PR is not complete. So closing.

Maybe we could just provide a full path to used template? So whatever information needed can be juse read from it? Thinking out loudly...

Not sure it will work in all cases. Specially if the application that finally consumes the json doesn't have access to the template file.

Anyway, thanks a lot for your time.