invoice-x / invoice2data

Extract structured data from PDF invoices
MIT License
1.84k stars 482 forks source link

Brainstorming new features #75

Closed m3nu closed 6 years ago

m3nu commented 6 years ago

After looking at the available literature, here some ideas on new features:

1: https://docparser.com/ 2: https://en.wikipedia.org/wiki/Universal_Business_Language 3: https://medium.com/tradeshift-engineering/scaling-up-machine-learning-algorithm-for-form-recognition-bd09b319e14a 4: https://arxiv.org/pdf/1708.07403.pdf 5: http://cs229.stanford.edu/proj2016/report/LiuWanZhang-UnstructuredDocumentRecognitionOnBusinessInvoice-report.pdf 6: http://python.apichecklist.com/

m3nu commented 6 years ago

@utsav666 commented an hour ago:

Well for initial purpose this repo is absolutely perfect in fetching the entities.......but how could we make this usecase possible using a ml approach(tensorflow,keras etc....).So just to start with the new thought I am sharing a repo which is fetching the table from pdf using ml..here is the link https://github.com/HazyResearch/TreeStructure .....you can go through this and let have a brainstorming together

m3nu commented 6 years ago

There should be an easier way to create and manage templates. Maybe as web service? Many people won't know regex or can do pull requests. But they can probably choose from a list of possible keywords and still aid in template creation.

sanjayio commented 6 years ago

If we are to build a chrome plugin or some form of UI for this, we can pull in the fields from the initial text processing and show them to user. Then the user can select which fields to be extracted for the template and also key in some keywords for the template. And they can save it as a custom template.

sanjayio commented 6 years ago

Yes a web service looks like a good idea to manage templates instead of a pull request each time a template is created. I was thinking of the possibility of saving these templates into a database rather than as files for easier management. Any suggestions?

AvatarSenju commented 6 years ago

For gui features we could implement Tkinter, it might also help improve stability in Windows, and for web based services I suggest that we look into implementing Django into the project, it eases the use of database and different app models for the project

sanjayio commented 6 years ago

Hmm yes. Django seems like a good option to implement the web services. It supports NoSQL databases like MongoDB as well, which I feel, is a good option to store our templates. And MongoDB even supports dumping the collection in JSON format natively, which I feel is good for building the services for templates.

m3nu commented 6 years ago

We'll surely not add Django to the current project. It would be more like a add-on. There could still be a GUI to edit the templates. Testing should be covered as well. Like saving expected results for some test PDFs.

duskybomb commented 6 years ago

@m3nu I have made a prototype for GUI implementation using PyQt5. Visit guiInvoice2data.

If you get time have a look at it and also if you can rerun the tests of test-patch branch of forked repo it will be highly appreciated.

duskybomb commented 6 years ago

@m3nu I updated guiInvocie2data

sheikmohdimran commented 5 years ago

Link 4 arxiv document is implemented here -> https://github.com/naiveHobo/InvoiceNet