microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.19k stars 2.55k forks source link

Demo: adding visual embeddings to LayoutLM #314

Open NielsRogge opened 3 years ago

NielsRogge commented 3 years ago

A much requested feature/question in this repo was "how do you add visual embeddings to LayoutLM?". I wondered how this worked myself, so (just in time for the release of LayoutLM 2.0), here's a notebook that fine-tunes LayoutLM on the FUNSD dataset, thereby adding visual embeddings from a pre-trained ResNet-101 backbone (as was done in the paper):

https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLM/Add_image_embeddings_to_LayoutLM.ipynb

First, a document image is resized to 3x224x224 and sent through a pre-trained ResNet-101 to obtain a feature map of shape (1024x14x14). Next, I use ROI-align to turn each bounding box of the original document image into a feature map of shape (1024x3x3), which is then flattened and linearly projected to match the hidden_size of LayoutLM (which is 768 for the base model). I assume that the authors did use something similar (either ROI-pooling as in Faster-R-CNN or ROI-aligning which was introduced later and improved the performance compared to ROI-pooling). The parameters of the ResNet model are updated during training, so we're effectively fine-tuning it, together with LayoutLM.

By adding these visual features, I was able to improve performance on the test set compared to using only text + layout (bounding boxes) information to around the following:

'precision': 0.8053668087066682, 'recall': 0.8163670324538874, 'f1': 0.8108296133109165

Related issues:

201

95

265

80

243

286

285

249

165

97

163

victor-ab commented 3 years ago

@NielsRogge Nice work, thanks for sharing! What was the f1 score without the image features?

NielsRogge commented 3 years ago

Hi, thank you!

As can be seen in my earlier notebook, the performance was:

'precision': 0.7239292364990689, 'recall': 0.7778889444722361, 'f1': 0.7499397154569568

Note that my earlier notebook used BIOES tagging, now I'm using more simple BIO tagging.

ruifcruz commented 3 years ago

You rock!! 🚀

fredo838 commented 3 years ago

Respect!

NormXU commented 3 years ago

Fancy Work!

brunnurs commented 3 years ago

Your work for the NLP/NLU community, especially for us guys trying to apply this papers to use cases, is extremely helpful! Many thanks and keep up the good work.

asynxc commented 3 years ago

is this the exact algo that used to produce the SOA results by the team ?

nice work by the way, thxs !!