Closed JW-xiilab closed 1 year ago
We extract video features every 2 seconds, 60 features cover the full-length video of max 120 seconds. This is mentioned in section 4.1 of our paper,
For video, we use SlowFast [5] and the video encoder (ViT-B/32) of CLIP [31] to extract features every 2 seconds.
Thanks for the quick reply.
With all due respect, I just want to clarify couple of things.
Q1. Isn't it 75 features(clips) of max 150 seconds for each segment?
These raw videos are then segmented into 150-second short videos for annotation
Q2. Regardless of the max full-length of the video, the all raw video dataset seems to start with "start_time: 60" for every video ID. [Data_Readme] (https://github.com/jayleicn/moment_detr/tree/main/data#qvhighlights-dataset) Does this not mean the original video has been cropped starting at 60 seconds? Also, the feature dataset seems to be in a same format such as _-5vXZfppKE_60.0210.0.mp4 in video dataset, and -5vXZfppKE_60.0_210.0.npz in features dataset.
Oh sorry, I misunderstood your question. Q1, yes, the videos have a max 150 seconds in length. Q2, yes, we cropped the original videos beginning 60 seconds, into 150 seconds clips, for further annotation. Since there are usually some less relevant information appear in the start of the video, e.g., "please subscribe", promotion, or other high-level summary.
Thank you once again.
Great, question solved!!
Hi, thanks a lot for the decent work.
As I am working through your work, I just realised that the all videos are cropped starting at 60 seconds (ie, all the first segments of video are starting from 60 seconds). Is there any reason why the video formats are preprocessed this way? Because I couldn't find any mention in the paper.
Does this mean the model is not trained with the first 60 seconds of each video?
Thanks in advance. And sorry if I have just missed this point in the paper/repository.