shabie / docformer

Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU)
MIT License
253 stars 40 forks source link

NER task #35

Closed BakingBrains closed 2 years ago

BakingBrains commented 2 years ago

@uakarsh @shabie Hello

Thank you for the great work. Can you give some more insights on NER task?

Thanks and Regards.

uakarsh commented 2 years ago

@BakingBrains Hi,

Actually, the way I trained DocFormer on RVL-CDIP has some assumptions (you can check out the Kaggle Notebook, where I mentioned the assumptions, and these assumptions were based on the fact that, the GPU memory was limited and the time constraints)

You can check out this line of code, where I load the weights. Ckpt. If that doesn't work, the checkpoint are saved in the version 5 of the kaggle notebook (shared in the examples/docformer_pl/ of cloned repo of docformer)

And for the NER task, in some time I would be uploading the script as well, so stay tuned. And, as soon as I upload it, I would let you know the same

Hope this helps

Regards, Akarsh

BakingBrains commented 2 years ago

@uakarsh Thank you. What do you think the number of epochs required to get good output for document classification.

Also, in the paper author has mentioned:

DocFormer is pre-trained for 5 epochs, then we remove all
three task heads. We add one linear projection head and
fine-tune all components of the model for all downstream
tasks.

I see here adding Linear layer.

self.resnet = ResNetFeatureExtractor(hidden_dim = config['max_position_embeddings'])
self.embeddings = DocFormerEmbeddings(config)
self.lang_emb = LanguageFeatureExtractor()
self.config = config
self.dropout = nn.Dropout(config['hidden_dropout_prob'])
self.linear_layer = nn.Linear(in_features = config['hidden_size'], out_features = len(id2label)) 

In section 3.2. Are we doing the same thing for document classification as well? Had this doubt.

Thanks and Regards

uakarsh commented 2 years ago

Hi,

I think that the good results is a subjective topic (maybe, depends upon your usage of application, i.e how you wish to use this architecture). From my perspective, you can just to tune the hyperparameters and observe the results (I have integrated W&B for visualization of progress), and come to a conclusion

Currently, if you are talking about the 4th Notebook (i.e the Kaggle Notebook), no we are not using the pre-trained weights (I have seperately described the pre-training on MLM), but rather directly training it from scratch

If you want to go with pre-training, here is a simple approach. Pre-train the docformer with MLM (from 3rd notebook in the example), and then use the 4th notebook (and load the specific weights from the checkpoint you saved)

Hope this helps.

Regards, Akarsh

BakingBrains commented 2 years ago

@uakarsh Thank you😄