naiveHobo / InvoiceNet

Deep neural network to extract intelligent information from invoice documents.
MIT License
2.47k stars 391 forks source link

info: table fields #20

Open IzzyHibbert opened 4 years ago

IzzyHibbert commented 4 years ago

Hi guys

I was wondering if fields represented in table (like the line items fields) are supported. If Yes, how to set them up ? If Not, that would really be a nice to have.

naiveHobo commented 4 years ago

I'm not completely sure which fields you're referring to. Do you have a sample image you can display in this ticket?

In general however, it should be possible to train a model for any field in a document as long as it's not a field which can have multiple occurrences. You can train a model even for fields that have multiple occurrences, but you will only be able to use one of the occurrences as the true label and the final extraction of such a model would also only be able to extract a single occurrence for this field.

IzzyHibbert commented 4 years ago

Thank you. You answered me already. I meant multiple occurrences such as the purchased articles described in the invoice, or "line items". You normally find more than one therefore you have Item1, Item2, Item3, and so on..

They typically are represented with a similar vertical and horizontal alignment.

Any chance that this is going to be included in the future or any idea how to start to develop in this direction ?

Thanks

ocr-avenger commented 4 years ago

Hi @IzzyHibbert , you can try this API dedicated to invoices https://scandocflow.com

mirfan899 commented 3 years ago

It seems InvoiceNet cant handle the tables for example.

XML_1609163070

How can we extract the items from the table as the criteria of using the custom field take only a single key-value pair?

seanbenhur commented 3 years ago

@mirfan899, @IzzyHibbert have you found a solution!?

mirfan899 commented 3 years ago

Nope. Use something else like yolo. I did solve the issue using Yolo3.

yackinn commented 2 years ago

@mirfan899 That's great. Can you provide a link to the repository?

mirfan899 commented 2 years ago

https://github.com/ultralytics/yolov3

yackinn commented 2 years ago

@mirfan899 Thank you. I'm not sure how yolo will extract invoice data though. Did you write your custom network?

mirfan899 commented 2 years ago

I labeled the dateset. Here are the results using yolo and then train a yolo v3 model.

gas

r-toroxel commented 2 years ago

cant you at least use the OPTIONAL data type for small lists?

AhmedHathout commented 2 years ago

@mirfan899 Thanks for sharing. Could you use Yolo to extract the line item details from the table? For example, if you want to extract payment history lines from your last photo i.e. something like:-

[{"Month": "Dec 2021", "HM3": 0.622, "Current Bill": 433.36, ...}, 
{"Month": "Nov 2021", "HM3": 0.387, ...}]

, is it possible? As far as I understood, neither InvoiceNet nor Yolo can do that.

mirfan899 commented 2 years ago

Why not. Yolo can solve the table issue. Just label the table and after detection use ocr to extract text.

yackinn commented 2 years ago

I guess he's aiming for extracting formatted line items with labels not just text. Extracting text using ocr from the table will just give you some text.

AhmedHathout commented 2 years ago

Thank you both for your quick replies Yes I wanted them to be formatted so that I know which text corresponds to which column. I will need to store these extracted data and process them depending on their columns.

mirfan899 commented 2 years ago

I have done similar to this. You need to label columns with yolo. Detect and OCR. You need more data to get better accuracy. Around 50 samples of a single template.

yackinn commented 2 years ago

Can you show a sample of how you labeled the columns with yolo to detect single line items? I'm also interested in this.

mirfan899 commented 2 years ago

annotation_table Like this.

yackinn commented 2 years ago

Can you provide a repository with sample code?