shabie / docformer

Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU)
MIT License
255 stars 40 forks source link

How to replicate FUNSD dataset for question answering #16

Closed mayankpathaklumiq closed 2 years ago

mayankpathaklumiq commented 2 years ago

I have tried to implement funsd dataset question answering but I am consfued how to use docformer multi modal features output.

uakarsh commented 2 years ago

One of the possible solutions could be.....

  1. The output of the DocFormer has a shape (batch_size, 512,768) # Extract it from the steps described in the readme
  2. So, now, extract the features from the question (maybe you can see some Visual Question Answering Models, about how they extract the language features), and then combine the results of the DocFormer Encoder + Language Features (either by concatenating or some other methods), and then apply linear layers of the desired sequence length, and you are done.

This was just a high-level overview, as far as I have tried to implement it.

For more answers, you can try to search medical Visual Question Answering with Transformers, and replace each of the normal transformers, with DocFormer, and boom, you are done.

And, I would request, that once you are done, please let us know the results, since it would benefit the community, and we would know that, it is working on QA as well. If you have any more questions, let me know. :)