microsoft / TAP

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)
MIT License
70 stars 11 forks source link

Different --model for using `textvqa_tap_ocrcc_best.ckpt`? #14

Open priyamtejaswin opened 2 years ago

priyamtejaswin commented 2 years ago

Hello devs,

Thank you for publishing this work, and for sharing these resources!

I was trying to run the evaluation code for TextVQA that is mentioned in the README. I can successfully run the following using textvqa_tap_base_best.ckpt

python tools/run.py --tasks vqa --datasets m4c_textvqa --model m4c_split --config configs/vqa/m4c_textvqa/tap_refine.yml --save_dir save/m4c_split_refine_test --run_type val --resume_file save/finetuned/textvqa_tap_base_best.ckpt

I believe this returns the results without the addition (OCR-CC) dataset. I think that checkpoint is saved under save/finetuned/textvqa_tap_ocrcc_best.ckpt

However, when I use the ocrcc checkpoint, it fails while loading the checkpoint...

2022-03-10T21:56:35 INFO: Loading datasets
2022-03-10T21:56:37 INFO: Fetching fastText model for OCR processing
2022-03-10T21:56:37 INFO: Loading fasttext model now from /usr1/home/ptejaswi/TAP/pythia/.vector_cache/wiki.en.bin
2022-03-10T21:56:47 INFO: Finished loading fasttext model
2022-03-10T21:56:50 INFO: CUDA Device 0 is: GeForce GTX TITAN X
2022-03-10T21:56:54 INFO: Torch version is: 1.8.1+cu101
2022-03-10T21:56:54 INFO: Loading checkpoint
2022-03-10T21:56:55 ERROR: Error(s) in loading state_dict for M4C:
    Missing key(s) in state_dict: "text_bert.encoder.layer.0.attention.self.query.weight", "text_bert.encoder.layer.0.attention.self.query.bias", "text_bert.encoder.layer.0.attention.self.key.weight", "text_bert.encoder.layer.0.attention.self.key.bias", "text_bert.encoder.layer.0.attention.self.value.weight", "text_bert.encoder.layer.0.attention.self.value.bias", "text_bert.encoder.layer.0.attention.output.dense.weight", "text_bert.encoder.layer.0.attention.output.dense.bias", "text_bert.encoder.layer.0.attention.output.LayerNorm.weight", "text_bert.encoder.layer.0.attention.output.LayerNorm.bias", "text_bert.encoder.layer.0.intermediate.dense.weight", "text_bert.encoder.layer.0.intermediate.dense.bias", "text_bert.encoder.layer.0.output.dense.weight", "text_bert.encoder.layer.0.output.dense.bias", "text_bert.encoder.layer.0.output.LayerNorm.weight", "text_bert.encoder.layer.0.output.LayerNorm.bias", "text_bert.encoder.layer.1.attention.self.query.weight", "text_bert.encoder.layer.1.attention.self.query.bias", "text_bert.encoder.layer.1.attention.self.key.weight", "text_bert.encoder.layer.1.attention.self.key.bias", "text_bert.encoder.layer.1.attention.self.value.weight", "text_bert.encoder.layer.1.attention.self.value.bias", "text_bert.encoder.layer.1.attention.output.dense.weight", "text_bert.encoder.layer.1.attention.output.dense.bias", "text_bert.encoder.layer.1.attention.output.LayerNorm.weight", "text_bert.encoder.layer.1.attention.output.LayerNorm.bias", "text_bert.encoder.layer.1.intermediate.dense.weight", "text_bert.encoder.layer.1.intermediate.dense.bias", "text_bert.encoder.layer.1.output.dense.weight", "text_bert.encoder.layer.1.output.dense.bias", "text_bert.encoder.layer.1.output.LayerNorm.weight", "text_bert.encoder.layer.1.output.LayerNorm.bias", "text_bert.encoder.layer.2.attention.self.query.weight", "text_bert.encoder.layer.2.attention.self.query.bias", "text_bert.encoder.layer.2.attention.self.key.weight", "text_bert.encoder.layer.2.attention.self.key.bias", "text_bert.encoder.layer.2.attention.self.value.weight", "text_bert.encoder.layer.2.attention.self.value.bias", "text_bert.encoder.layer.2.attention.output.dense.weight", "text_bert.encoder.layer.2.attention.output.dense.bias", "text_bert.encoder.layer.2.attention.output.LayerNorm.weight", "text_bert.encoder.layer.2.attention.output.LayerNorm.bias", "text_bert.encoder.layer.2.intermediate.dense.weight", "text_bert.encoder.layer.2.intermediate.dense.bias", "text_bert.encoder.layer.2.output.dense.weight", "text_bert.encoder.layer.2.output.dense.bias", "text_bert.encoder.layer.2.output.LayerNorm.weight", "text_bert.encoder.layer.2.output.LayerNorm.bias".

Do I need to change the --model argument passed to run.py? At the moment it is --model m4c_split. This is the command to reproduce the above error:

python tools/run.py --tasks vqa --datasets m4c_textvqa --model m4c_split --config configs/vqa/m4c_textvqa/tap_refine.yml --save_dir save/m4c_orcc_refine_test --run_type val --resume_file save/finetuned/textvqa_tap_ocrcc_best.ckpt
zyang-ur commented 2 years ago

Hi @priyamtejaswin ,

Sorry for the confusion. TAP-OCRCC uses 0/12 layers instead of 3/4 layers (detailed in paper Table 5). Thus we need to update the layer number in the config file. I'll try to update the config file, before that, you could update the layer number to 0/12 (3->0, 4->12) here and see if it solves the problem.

https://github.com/microsoft/TAP/blob/352891f93c75ac5d6b9ba141bbe831477dcdd807/configs/vqa/m4c_textvqa/tap_refine.yml#L57 https://github.com/microsoft/TAP/blob/352891f93c75ac5d6b9ba141bbe831477dcdd807/configs/vqa/m4c_textvqa/tap_refine.yml#L66