microsoft / SwinBERT

Research code for CVPR 2022 paper "SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning"
https://arxiv.org/abs/2111.13196
MIT License
237 stars 35 forks source link

TVC Dataset #18

Open engindeniz opened 2 years ago

engindeniz commented 2 years ago

Hi,

Thanks for the great work and publicly available code.

For the TVC dataset, 3 FPS video frames are provided officially due to copyright issues. According to your code, it seems that you use videos from the TVC dataset. I am wondering how did you obtain the videos?

Thanks in advance.

linjieli222 commented 2 years ago

Hi there,

Thanks for your interests in this project. For TVC, we simply concatenate the released 3FPS frames as video via ffmpeg. From there, we extracted 32/48/64 frames to construct the frame tsvs for training and inference.

Our end2end pipeline is for general purpose to support on-the-fly decoding for captioning tasks similar to TVC. In our implementation, we take the frame tsvs as input for training and testing when evaluating on TVC.