microsoft / VideoX

VideoX: a collection of video cross-modal models
Other
968 stars 160 forks source link

About fps and time_unit of videos #55

Closed zpyi closed 2 years ago

zpyi commented 2 years ago

As described in paper: "Specifically, videos are decoded at 25 fps and the output of the last average pooling layer are extracted for every 16 consecutive frames. Therefore, each video clip corresponds to 0.64 second". Take TACoS for example, fps is 29.4 in train.json, I am confused about how to decode a video in 25fps? Did you discard some frames? If we decode a video by its original fps, we will get a 16/29.4 time unit. Looking forward to your reply, thanks!

Sy-Zhang commented 2 years ago

As described in paper: "Specifically, videos are decoded at 25 fps and the output of the last average pooling layer are extracted for every 16 consecutive frames. Therefore, each video clip corresponds to 0.64 second". Take TACoS for example, fps is 29.4 in train.json, I am confused about how to decode a video in 25fps? Did you discard some frames? If we decode a video by its original fps, we will get a 16/29.4 time unit. Looking forward to your reply, thanks!

We use ffmpeg to change every video to 25 fps. As mentioned in https://trac.ffmpeg.org/wiki/ChangingFrameRate, "When the frame rate is changed, ffmpeg will drop or duplicate frames as necessary to achieve the targeted output frame rate".