Closed m3nu closed 6 years ago
@utsav666 commented an hour ago:
Well for initial purpose this repo is absolutely perfect in fetching the entities.......but how could we make this usecase possible using a ml approach(tensorflow,keras etc....).So just to start with the new thought I am sharing a repo which is fetching the table from pdf using ml..here is the link https://github.com/HazyResearch/TreeStructure .....you can go through this and let have a brainstorming together
There should be an easier way to create and manage templates. Maybe as web service? Many people won't know regex or can do pull requests. But they can probably choose from a list of possible keywords and still aid in template creation.
If we are to build a chrome plugin or some form of UI for this, we can pull in the fields from the initial text processing and show them to user. Then the user can select which fields to be extracted for the template and also key in some keywords for the template. And they can save it as a custom template.
Yes a web service looks like a good idea to manage templates instead of a pull request each time a template is created. I was thinking of the possibility of saving these templates into a database rather than as files for easier management. Any suggestions?
For gui features we could implement Tkinter, it might also help improve stability in Windows, and for web based services I suggest that we look into implementing Django into the project, it eases the use of database and different app models for the project
Hmm yes. Django seems like a good option to implement the web services. It supports NoSQL databases like MongoDB as well, which I feel, is a good option to store our templates. And MongoDB even supports dumping the collection in JSON format natively, which I feel is good for building the services for templates.
We'll surely not add Django to the current project. It would be more like a add-on. There could still be a GUI to edit the templates. Testing should be covered as well. Like saving expected results for some test PDFs.
@m3nu I have made a prototype for GUI implementation using PyQt5. Visit guiInvoice2data.
If you get time have a look at it and also if you can rerun the tests of test-patch
branch of forked repo it will be highly appreciated.
@m3nu I updated guiInvocie2data
Link 4 arxiv document is implemented here -> https://github.com/naiveHobo/InvoiceNet
After looking at the available literature, here some ideas on new features:
1: https://docparser.com/ 2: https://en.wikipedia.org/wiki/Universal_Business_Language 3: https://medium.com/tradeshift-engineering/scaling-up-machine-learning-algorithm-for-form-recognition-bd09b319e14a 4: https://arxiv.org/pdf/1708.07403.pdf 5: http://cs229.stanford.edu/proj2016/report/LiuWanZhang-UnstructuredDocumentRecognitionOnBusinessInvoice-report.pdf 6: http://python.apichecklist.com/