Different accuracy after evaluating on test set of MSRVTT-QA

jayleicn / singularity

[ACL 2023] Official PyTorch code for Singularity model in "Revealing Single Frame Bias for Video-and-Language Learning"

https://arxiv.org/abs/2206.03428

MIT License

130 stars 14 forks source link

Different accuracy after evaluating on test set of MSRVTT-QA #17

Closed soumyasj closed 2 years ago

soumyasj commented 2 years ago

Dear authors,

I am trying to reproduce results of MSRVTT-QA using multimodal encoder as decoder. After running the scripts/eval_vqa.sh on MSRVTT-QA test set, on "ft_msrvtt_qa_singularity_temporal_17m.pth" checkpoint, i am getting very low accuracy.

The following is the command used to run the script:

bash scripts/eval_vqa.sh msrvtt "ft_msrvtt_qa_singularity_temporal_17m.pth" reproduce_original_result_on_test_msrvtt_qa local 1

The below is the accuracy obtained for test split of MSRVTT-QA:

[] op_msrvtt_eval (url)

Can you please let me know if there is some issue with evaluation code and if there is can anyone please let me know how to reproduce the correct results.

ShuangLI59 commented 2 years ago

I have the same question. My results on ActivityNet are low as well.

jayleicn commented 2 years ago

The checkpoint you are evaluating is a Singularity-temporal model (model name string ft_msrvtt_qa_singularity_temporal_17m.pth contains singularity_temporal) that processes 4 frames with a temporal layer. Thus it is required to add additional flags to construct this model.

Note that, if you are evaluating Singularity-temporal models, additional flags that consturcts the temporal model should be appened. For examople, when evaluating a 2-layer temporal model,

bash scripts/eval_ret.sh didemo /path/to/pt_ckpt.pth eval_12frm local 1 \
 test_types=[val,test] video_input.num_frames_test=12 \
 add_temporal_embed=True \
 temporal_vision_encoder.enable=True \
 temporal_vision_encoder.num_layers=2

These are detailed in https://github.com/jayleicn/singularity#evaluation.

soumyasj commented 2 years ago

Thanks, I missed it earlier and added it now. One more thing that I had to change to run eval script is in shared_utils.py in line 85 (https://github.com/jayleicn/singularity/blob/main/tasks/shared_utils.py#L85) and line 91 (https://github.com/jayleicn/singularity/blob/main/tasks/shared_utils.py#L91).

I changed these lines to:

layer_num = int(encoder_keys[3])

encoder_keys[3] = str(decoder_layer_num)

respectively. Is this right?

This is the error which I get when not changing anything from the code:

Traceback (most recent call last):                                                                                                          [886/1764]
  File "tasks/vqa.py", line 399, in <module>
    main(cfg)
  File "tasks/vqa.py", line 256, in main
    find_unused_parameters=True
  File "/ssd_scratch/cvit/soumyajahagirdar/data/sid_final_trial_12frm/code/singularity/tasks/shared_utils.py", line 94, in setup_model
    layer_num = int(encoder_keys[4])
ValueError: invalid literal for int() with base 10: 'attention'

soumyasj commented 2 years ago

I had one another doubt. In qa_msrvtt.yaml if i change the test_types: [val,] to test_types: [val, test], after every epoch, it will evaluate on both val and test set. I wanted to know if the test accuracy computed here is valid. ( I am asking this because after all epochs are completed, eval after training phase gives very low accuracy on both val and test set, when compared to the best accuracy which is recorded while training.) Please help.

ShuangLI59 commented 2 years ago

My results on ActivityNet QA are still low. My inference command:

output_dir='pretrained_model/anet_qa/ft_anet_qa_singularity_temporal_17m' pretrained_path='pretrained_model/anet_qa/ft_anet_qa_singularity_temporal_17m.pth'

python tasks/vqa.py \ ${config_path} \ output_dir=${output_dir} \ pretrained_path=${pretrained_path} \ evaluate=True \ video_input.num_frames_test=12 \ add_temporal_embed=True \ temporal_vision_encoder.enable=True \ temporal_vision_encoder.num_layers=2

In the config file:
test_types: [test, ]

The generated JSON file eval_res_best.json: { "test": { "overall": 2.75 } }

Is there anything wrong with the commands or the config file?

RealAntonVoronov commented 2 years ago

Upvote for this issue. I'm running single-frame 17M model on ActivityNet-QA test set with:

bash scripts/eval_vqa.sh anet anet_qa/ft_anet_qa_singularity_17m.pth single_frame_17m/ local 1

after changing num_layer index from 4 to 3 in lines 85, 91 in shared_utils.py (same as @soumyasj) and getting the result json: { "test": { "overall": 11.16 } }

jayleicn commented 2 years ago

Hi everyone, Thanks for your interest in our work and sorry for the late response. This error is fixed in the latest commit ee77824. I tested it against ActivityNet-QA val split, and it got 49.6 val accuracy.