open-mmlab / mmocr

OpenMMLab Text Detection, Recognition and Understanding Toolbox
https://mmocr.readthedocs.io/en/dev-1.x/
Apache License 2.0
4.29k stars 744 forks source link

Training data and code for key-information-extraction #478

Closed INF800 closed 2 years ago

INF800 commented 3 years ago

Hi. Amazing work at https://mmocr.readthedocs.io/en/latest/demo.html#example-4-text-detection-recognition-key-information-extraction. Where can I code code for training the model for key-information extraction?

P.S If it is not available in docs, I can send a PR.

gaotongxiao commented 3 years ago

Thanks! Your question is unclear to me, do you want to train the model from scratch? You can find the config of SDMGR model at configs/kie/sdmgr/sdmgr_unet16_60e_wildreceipt.py. You can first get yourself familiarized with the config and prepare the dataset referring to our docs. (check out Getting Started and Training) And you can start training by running python tools/train.py configs/kie/sdmgr/sdmgr_unet16_60e_wildreceipt.py.

INF800 commented 3 years ago

Yes, I was looking for training model from scratch using my own annotated dataset. I will try to reproduce training in colab and get back here if I run into any issues.

INF800 commented 3 years ago

Hi everything is working great.

But I have one doubt. Some annotations are wrong in the dataset. For example here you can see that prod_price is wrongly annotated as others. image

Is the best model trained on same data?

Notebook for training and visualizing dataset https://colab.research.google.com/drive/1Q80UoNZjunHnKrgP1GJFtnouuIjNReEP?usp=sharing

INF800 commented 3 years ago

One more thing - here we are detecting key and value pairs independent of each other. For example we may be detecting multiple product keys and product values. But we will not be able to map each key to it's exact value.

Is there any script / method to map keys to it's values ?

INF800 commented 3 years ago

Another doubt is that the text annotations are present without spaces for example, "THANKYOU FOR SHOPPING WITH US" is annotated as THANKYOUFORSHOPPINGWITHUS.

karndeepsingh commented 3 years ago

@INF800 Hi Brother, I am also working on extraction of key information from document. Can you suggest some reference how to annotate and prepare dataset for this task? It would be great help.

It would be good if we can possibly connect on Linkedin and work together. Here is my Id: https://www.linkedin.com/in/karndeepsingh Thanks.

karndeepsingh commented 3 years ago

@gaotongxiao Hey, How I can annotate my dataset for KIE task? Like if you could suggest me anytool or any approach to annotate that would surely help me. Thanks

INF800 commented 3 years ago

Hi, @karndeepsingh you can use labelstudio. You will be needing to write post-processing script

karndeepsingh commented 3 years ago

Hi, @karndeepsingh you can use labelstudio. You will be needing to write post-processing script

Hi @INF800 Can you share that post-processing script if you have preapred. I would like to see how it is being done so that I can prepare it according my usecase.

Thanks

INF800 commented 3 years ago

Hi, @karndeepsingh you can use labelstudio. You will be needing to write post-processing script

Hi @INF800 Can you share that post-processing script if you have preapred. I would like to see how it is being done so that I can prepare it according my usecase.

Thanks

I didn't write it yet. But it will be very much straight forward. You will have to convert the output of labelstudio into the wildrecipt data format.

By the way the linkedin account is not working. Link is wrong.

karndeepsingh commented 3 years ago

Hi, @karndeepsingh you can use labelstudio. You will be needing to write post-processing script

Hi @INF800 Can you share that post-processing script if you have preapred. I would like to see how it is being done so that I can prepare it according my usecase. Thanks

I didn't write it yet. But it will be very much straight forward. You will have to convert the output of labelstudio into the wildrecipt data format.

By the way the linkedin account is not working. Link is wrong.

Okay you mean that train.txt file json formate. It can be done easily. Try this @INF800 : https://www.linkedin.com/in/karndeepsingh/

karndeepsingh commented 3 years ago

@INF800 While using Labelstudio we need to use OCR formate annotation right?

gaotongxiao commented 3 years ago

Hi everything is working great.

But I have one doubt. Some annotations are wrong in the dataset. For example here you can see that prod_price is wrongly annotated as others. image

Is the best model trained on same data?

Yes, the best model was trained on WildReceipt. It seems some annotations are actually wrong.

One more thing - here we are detecting key and value pairs independent of each other. For example we may be detecting multiple product keys and product values. But we will not be able to map each key to it's exact value.

Is there any script / method to map keys to it's values ?

Currently not. It's technically possible to extract the weights from the node graph (where each node maps to one text box) and determine the best-matched pairs tho.

Another doubt is that the text annotations are present without spaces for example, "THANKYOU FOR SHOPPING WITH US" is annotated as THANKYOUFORSHOPPINGWITHUS.

Is there any reason why we are omitting spaces? What will happen if we keep spaces i.e use "text": "THANKYOU FOR SHOPPING WITH US"

SDMGR uses char-level embedding to deal with unseen character combinations (especially for digital values). So far as I understand, both the dictionary provided in WildReceipt and the implementation of KIE's text parser are not friendly to spaces. Even if you keep spaces in texts, they will still be skipped. However, I can point you to some hints if you want to testify the importance of spaces.

I can't tell the intuition behind omitting spaces since I'm not the author of SDMGR, but I can help you ping @cuhk-hbsun

gaotongxiao commented 3 years ago

@gaotongxiao Hey, How I can annotate my dataset for KIE task? Like if you could suggest me anytool or any approach to annotate that would surely help me. Thanks

@karndeepsingh I am not one of the annotators so I'm not quite sure about the details. @INF800 's suggestion makes sense to me and you can probably try. You're also welcome to share your experience and/or code with our community if you have made any progress :)

karndeepsingh commented 3 years ago

@gaotongxiao Thanks for your reply. In examples while training KIE, it uses reciept data for entity extraction. If want to extract the information from scanned document will it work? Just wanted your suggestion before I start annotating for training. Below is the example image, Red mark are the entities I want to extract. InkedATG-5747901-04_LI

gaotongxiao commented 3 years ago

@karndeepsingh I don't think it's a good idea as of now. You need to annotate all texts in the document for SDMGR to learn, which can be tons of works. SDMGR is also not well optimized for long documents so it can be extremely slow when there are too many text boxes on an image, which is unfortunately the case with your example data.

I'm curious, though: Are you going to use KIE for practical purposes, or academic purposes? We've been actively collecting feedbacks and determines our next move accordingly.

karndeepsingh commented 3 years ago

@karndeepsingh I don't think it's a good idea as of now. You need to annotate all texts in the document for SDMGR to learn, which can be tons of works. SDMGR is also not well optimized for long documents so it can be extremely slow when there are too many text boxes on an image, which is unfortunately the case with your example data.

I'm curious, though: Are you going to use KIE for practical purposes, or academic purposes? We've been actively collecting feedbacks and determines our next move accordingly.

Thanks for your reply. I was in dilemma. Thanks you cleared it out. One thing more, What approach would you recommend to achieve extraction of the word that I highlighted?

And yeah I am trying to build it for practical purpose. @gaotongxiao

gaotongxiao commented 3 years ago

@karndeepsingh Thanks. Extracting words from images is exactly what MMOCR can do by ocr.py docs.

INF800 commented 3 years ago

@karndeepsingh Thanks. Extracting words from images is exactly what MMOCR can do by ocr.py docs.

In @karndeepsingh 's image above, there are 2 key-value pairs. (i) key is loan grantor's name "Grantor(s)" and value is "Bryan D Lindwig" (ii) address key is not available but address value is available (which happens to be entire paragraph).

So @gaotongxiao, can we follow annotation criteria in which we will ONLY annotate "Grantor(s)" as grantor_key and "Bryan D Lindwig" as grantor_value. And the whole address paragraph as address_value. In this image we will not be annotating address_key as it is not available.

Please note we won't be annotating any other text

gaotongxiao commented 3 years ago

@INF800 My concern is SDMGR relies on the upstream OCR engine that feeds it with annotated text boxes. So even if you only annotate part of the text boxes in the training set, SDMGR will still get confused by noises when the full annotation of texts are given, unless you have some heuristic to eliminate those noisy texts.

amitbcp commented 3 years ago

@INF800 @karndeepsingh you can refer to this issue for annotation using label studio and the pos processing scripts : https://github.com/open-mmlab/mmocr/issues/434.

Also, @gaotongxiao previously @innerlee mentioned the framework supports edge based training and loss and it has to be documented so that we can use it. the issue is https://github.com/open-mmlab/mmocr/issues/248 . So once we know how to use this, I believe it will usher create research or creating such dataset lo learn the edges mapping Key & Values

INF800 commented 3 years ago

@INF800 @karndeepsingh you can refer to this issue for annotation using label studio and the pos processing scripts : #434.

Also, @gaotongxiao previously @innerlee mentioned the framework supports edge based training and loss and it has to be documented so that we can use it. the issue is #248 . So once we know how to use this, I believe it will usher create research or creating such dataset lo learn the edges mapping Key & Values

gaotongxiao commented 3 years ago

@INF800 I just talked to my colleagues and confirmed that they have the code ready. But they still need some time to clean it up. We will let you guys know when it is released.

INF800 commented 3 years ago

@INF800 I just talked to my colleagues and confirmed that they have the code ready. But they still need some time to clean it up. We will let you guys know when it is released.

Hi, can I know the PR number so that I can track it?

pushpalatha1405 commented 2 years ago

@INF800 Hi Brother, I am also working on extraction of key information from document. Can you suggest some reference how to annotate and prepare dataset for this task? It would be great help.

It would be good if we can possibly connect on Linkedin and work together. Here is my Id: https://www.linkedin.com/in/karndeepsingh Thanks.

@karndeepsingh Hi, If u r still looking for mmocr to train test on your custom dataset. i can help you as i have used mmocr on the custom dataset for train,test aand infer for our product. if you are still looking for help can contact me pushpa.abhinav@gmail.com

regards, pushpa