microsoft / SwinBERT

Research code for CVPR 2022 paper "SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning"
https://arxiv.org/abs/2111.13196
MIT License
237 stars 35 forks source link

Questions About Frames extracting #35

Open fringe-k opened 1 year ago

fringe-k commented 1 year ago

In the Line 85 of SwinBERT/create_image_frame_tsv.py. " current_image_path = previous_image_path "

Does it mean when the amount of extracted images is less than num_frames, you will pad them to num_frames with the last image? This step is a little confused to me. Is the result of it different from the one which do not copy the last image?

liyaowei-stu commented 1 year ago

I followed “/prepro/extract youcook2 frms.sh" to execute "./prepro/extract_ frames.py", but it doesn't seem to work, and the following results are obtained:

python ./prepro/extract_frames.py \ --video_root_dir ./datasets/MSRVTT-v2/videos \ --save_dir ./datasets/MSRVTT-v2/ \ --video_info_tsv ./datasets/MSRVTT-v2/val.img.tsv \ --num_frames 32 \ 0it [00:00, ?it/s]`

Is my operation incorrect? Thank you very much ~

tiesanguaixia commented 1 year ago

In the Line 85 of SwinBERT/create_image_frame_tsv.py. " current_image_path = previous_image_path "

Does it mean when the amount of extracted images is less than num_frames, you will pad them to num_frames with the last image? This step is a little confused to me. Is the result of it different from the one which do not copy the last image?

Hi! Have you reproduced the results in paper? May I ask did you adjust the value of 'loss_sparse_w' and the 'learning_rate' in command? For the 'loss_sparsew', I guess it's the regularization hyperparameter of $Loss{SPARSE}$ , i.e. the $\lambda$ in the paper. In the appendix, it seems like for MSR-VTT, the model performs best when $\lambda$ = 5. But the why the default value of 'loss_sparse_w' in command is 0.5? Do I need to adjust it to 5? Thank you a lot!