Closed soumyasj closed 2 years ago
I have the same question. My results on ActivityNet are low as well.
The checkpoint you are evaluating is a Singularity-temporal model (model name string ft_msrvtt_qa_singularity_temporal_17m.pth
contains singularity_temporal
) that processes 4 frames with a temporal layer. Thus it is required to add additional flags to construct this model.
Note that, if you are evaluating Singularity-temporal models, additional flags that consturcts the temporal model should be appened. For examople, when evaluating a 2-layer temporal model,
bash scripts/eval_ret.sh didemo /path/to/pt_ckpt.pth eval_12frm local 1 \
test_types=[val,test] video_input.num_frames_test=12 \
add_temporal_embed=True \
temporal_vision_encoder.enable=True \
temporal_vision_encoder.num_layers=2
These are detailed in https://github.com/jayleicn/singularity#evaluation.
Thanks, I missed it earlier and added it now. One more thing that I had to change to run eval script is in shared_utils.py
in line 85 (https://github.com/jayleicn/singularity/blob/main/tasks/shared_utils.py#L85) and line 91 (https://github.com/jayleicn/singularity/blob/main/tasks/shared_utils.py#L91).
I changed these lines to:
layer_num = int(encoder_keys[3])
encoder_keys[3] = str(decoder_layer_num)
respectively. Is this right?
This is the error which I get when not changing anything from the code:
Traceback (most recent call last): [886/1764]
File "tasks/vqa.py", line 399, in <module>
main(cfg)
File "tasks/vqa.py", line 256, in main
find_unused_parameters=True
File "/ssd_scratch/cvit/soumyajahagirdar/data/sid_final_trial_12frm/code/singularity/tasks/shared_utils.py", line 94, in setup_model
layer_num = int(encoder_keys[4])
ValueError: invalid literal for int() with base 10: 'attention'
I had one another doubt. In qa_msrvtt.yaml
if i change the test_types: [val,]
to test_types: [val, test]
, after every epoch, it will evaluate on both val
and test
set.
I wanted to know if the test accuracy computed here is valid.
( I am asking this because after all epochs are completed, eval after training
phase gives very low accuracy on both val and test set, when compared to the best accuracy which is recorded while training.)
Please help.
My results on ActivityNet QA are still low. My inference command:
output_dir='pretrained_model/anet_qa/ft_anet_qa_singularity_temporal_17m' pretrained_path='pretrained_model/anet_qa/ft_anet_qa_singularity_temporal_17m.pth'
python tasks/vqa.py \ ${config_path} \ output_dir=${output_dir} \ pretrained_path=${pretrained_path} \ evaluate=True \ video_input.num_frames_test=12 \ add_temporal_embed=True \ temporal_vision_encoder.enable=True \ temporal_vision_encoder.num_layers=2
In the config file:
test_types: [test, ]
The generated JSON file eval_res_best.json: { "test": { "overall": 2.75 } }
Is there anything wrong with the commands or the config file?
Upvote for this issue. I'm running single-frame 17M model on ActivityNet-QA test set with:
bash scripts/eval_vqa.sh anet anet_qa/ft_anet_qa_singularity_17m.pth single_frame_17m/ local 1
after changing num_layer index from 4 to 3 in lines 85, 91 in shared_utils.py (same as @soumyasj) and getting the result json: { "test": { "overall": 11.16 } }
Dear authors,
I am trying to reproduce results of MSRVTT-QA using multimodal encoder as decoder. After running the scripts/eval_vqa.sh on MSRVTT-QA test set, on "ft_msrvtt_qa_singularity_temporal_17m.pth" checkpoint, i am getting very low accuracy.
The following is the command used to run the script:
bash scripts/eval_vqa.sh msrvtt "ft_msrvtt_qa_singularity_temporal_17m.pth" reproduce_original_result_on_test_msrvtt_qa local 1
The below is the accuracy obtained for test split of MSRVTT-QA:
[] (url)
Can you please let me know if there is some issue with evaluation code and if there is can anyone please let me know how to reproduce the correct results.