invoice-x / invoice2data

Extract structured data from PDF invoices
MIT License
1.83k stars 479 forks source link

Parsing Lines in Invoice --> Failed to find any lines for "lines" #548

Closed stony007de closed 9 months ago

stony007de commented 9 months ago

Hi there i'm in trouble with parse of Items in a invoice. All my other Items are working well, but the lookup for each Item Position at the Document hurts.

in the attachemend I added an Amazon invoice. --> 2024-01-25T062309.pdf Here i would like to find the Positions " Beschreibung, Menge, Stückpreis, USt. %, Stückpreis und Zwischensumme".

Mein Template dafür sieht wie folgt aus:

lines:
  start: Beschreibung\s+Menge\s+Stückpreis\s+USt. %\s+Stückpreis\s+Zwischensumme
  end: Versandkosten
  line: (?P<Beschreibung>.+)\s+(?P<Menge>\d+)\s+(?P<Stueckpreis>\d+\d+)s+(?P<Steuer>\d+.\d+)\s+(?P<Stueckpreis2>\d+\d+)\s+(?P<Zwischensumme>\d+\d+)
  skip_line: s*\(.*ohne USt.\)\s+\(.*inkl. USt.\)

and the Output is:

DEBUG:invoice2data.extract.parsers.lines: Testing Rules set #0
DEBUG:invoice2data.extract.parsers.lines: START lines block content ========================

                                                                              (ohne USt.)                 (inkl. USt.)             (inkl. USt.)

Normfest TORX Schraubendreher | Torx-Schraubendreher | (TI 5)        1             8,39 €     19%             9,99 €                   9,99 €
ASIN: B0B8PDXBWB

DEBUG:invoice2data.extract.parsers.lines: END lines block content ==========================
DEBUG:invoice2data.extract.parsers.lines: The following line doesn't match anything:
*                                                                              (ohne USt.)                 (inkl. USt.)             (inkl. USt.)*
DEBUG:invoice2data.extract.parsers.lines: The following line doesn't match anything:
*Normfest TORX Schraubendreher | Torx-Schraubendreher | (TI 5)        1             8,39 €     19%             9,99 €                   9,99 €*
DEBUG:invoice2data.extract.parsers.lines: The following line doesn't match anything:
*ASIN: B0B8PDXBWB*
DEBUG:invoice2data.extract.parsers.lines: Failed to find lines block start
WARNING:invoice2data.extract.parsers.lines: Failed to find any lines for "lines"

can anyone say me whats my problem?

bosd commented 9 months ago

Hi the problem seems to be in the regex. (I've been able to parse it, by adding the decimal dumbol into the regex and currency sign)

(?P<Beschreibung>.+)\s+(?P<Menge>\d+)\s+(?P<Stueckpreis>\d+[,]\d+)\s+[€]\s*(?P<Steuer>\d+).\s+(?P<Stueckpreis2>\d+[,]\d+)\s.\s+(?P<Zwischensumme>\d+[,]\d+)

https://regex101.com/r/k1xaa6/1

stony007de commented 9 months ago

perfect! Thanks!