invoice-x / invoice2data

Extract structured data from PDF invoices
MIT License
1.79k stars 476 forks source link

Is there any sample for capturing the address field #295

Open rageshS opened 4 years ago

rageshS commented 4 years ago

Hi All,

I am a newbie in using invoice2data python library, and really satisfied with this. it was very useful for me. But I find some limitations in multiline data in the invoice pdf such as the address field. Here I explain one of the problem, that I faced while using invoice2data to extract address data, that is if our desired data field is a multiline data and one more data fields lie on the horizontal position ( I mean left side or right side, for example, 'Invoice Address' and 'Trading Address' ), there is a chance to concatenate this data together while the extraction time.

image

I think if write the regular expression for capture 'invoice address' field, it will capture the 'Trading address' text too. I already checked the templates provided in this git repository. But I can't find any example for the capture address field from the invoice pdf.

rageshS commented 4 years ago

Any update on this ?

Jane-Ding commented 4 years ago

Hi, rageshS, Have u solved this problem? I also need to extract the address in multi-lines but it seems it can only use custom items which use fixed regex location to extract the address.

rageshS commented 4 years ago

Hi, rageshS, Have u solved this problem? I also need to extract the address in multi-lines but it seems it can only use custom items which use fixed regex location to extract the address.

@Jane-Ding I didn't get any helpful answer, and still facing the issue. I think this is not activly maintained git repo.

kavinsharma commented 3 years ago

Hi, @rageshS @Jane-Ding , I have discovered one way to do it by capturing the text using the area as pdf2text supports x,y coordinates. Yaml Template for area plugin : area:

you can have a look at this guide might help you out: https://www.youtube.com/watch?v=JOdLRe4MTmo&list=PLhPDb5zFmGR1CCSX_oxLGyPuHPbWEToSf