microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.21k stars 2.45k forks source link

Reproducing Performance on DocVQA using LayoutLMv3/LayoutLMv2 #841

Open allanj opened 1 year ago

allanj commented 1 year ago

I tried my best to reproduce the results reported in the paper, which is about 78% test set ANLS. But all I get is just 74% on the test set (73% on validation set), which is still way below what is reported.

Can we know more details about how to get the reported number.

My repo: https://github.com/allanj/LayoutLMv3-DocVQA Model I'm using: LayoutLMv3-base OCR I use: Microsoft READ API, using the latest model version.

  1. I tried different matching mechanism to find the answer.
  2. Try sliding window approach (which not really works)
  3. Even following the paper to use 128 batch size, 100k optimization steps (which is equivalent to 300 epochs I don't really think that is necessary.)

The best performance I can get for using LayoutLMv3-base is just about 73.3% on validation set.

I also refer the following issues as I can't really find a public codebase that can reproduce the DocVQA results.

  1. https://github.com/NielsRogge/Transformers-Tutorials/issues/49
  2. https://github.com/microsoft/unilm/issues/616
  3. https://github.com/microsoft/unilm/issues/501
  4. https://github.com/microsoft/unilm/issues/282

Appreciate that if the authors can give more suggestions/details about the experiments.

HYPJUDY commented 1 year ago

There are several steps to experiment on DocVQA with the extractive method:

  1. Pre-processing. 1.1 Get the text information of the documents using an OCR engine (e.g., Microsoft Read API) 1.2 Find the start and end token-level positions of each answer in the text (e.g., with edit distance matching)
  2. Model prediction. 2.1 Train (with parameter tuning) and predict the start and end positions with models (e.g., LayoutLMv2/3)
  3. Post-processing. 3.1 Reconstruct the answers from the text based on the predicted positions 3.2 Fix apparent errors (e.g., remove some whitespaces and punctuation marks)

Each step could have room for improvement. It can be helpful to analyze and improve the upper bound step by step. For example, what is the ANLS score calculated using the answers found by your start and end positions? If we can get a perfect text from human annotations, the score should be close to 100. With good OCR results, the score could be greater than 95.

allanj commented 1 year ago

Thanks. Is it possible to provide the details about how you did it for this dataset? I think this could be important to reproduce the performance and better help the open-source community.

roburst2 commented 1 year ago

@allanj I am trying to reproduce the result with layoulmv2 model on your code but getting below error RuntimeError: CUDA error: device-side assert triggered

This error is occurring on train_dataloader loop