What happened to the first 60 seconds of the video?

jayleicn / moment_detr

[NeurIPS 2021] Moment-DETR code and QVHighlights dataset

https://arxiv.org/abs/2107.09609

MIT License

259 stars 44 forks source link

What happened to the first 60 seconds of the video? #31

Closed JW-xiilab closed 1 year ago

JW-xiilab commented 1 year ago

Hi, thanks a lot for the decent work.

As I am working through your work, I just realised that the all videos are cropped starting at 60 seconds (ie, all the first segments of video are starting from 60 seconds). Is there any reason why the video formats are preprocessed this way? Because I couldn't find any mention in the paper.

Does this mean the model is not trained with the first 60 seconds of each video?

Thanks in advance. And sorry if I have just missed this point in the paper/repository.

jayleicn commented 1 year ago

We extract video features every 2 seconds, 60 features cover the full-length video of max 120 seconds. This is mentioned in section 4.1 of our paper,

For video, we use SlowFast [5] and the video encoder (ViT-B/32) of CLIP [31] to extract features every 2 seconds.

JW-xiilab commented 1 year ago

Thanks for the quick reply.

With all due respect, I just want to clarify couple of things.

Q1. Isn't it 75 features(clips) of max 150 seconds for each segment?

These raw videos are then segmented into 150-second short videos for annotation

Q2. Regardless of the max full-length of the video, the all raw video dataset seems to start with "start_time: 60" for every video ID. [Data_Readme] (https://github.com/jayleicn/moment_detr/tree/main/data#qvhighlights-dataset) Does this not mean the original video has been cropped starting at 60 seconds? Also, the feature dataset seems to be in a same format such as _-5vXZfppKE_60.0210.0.mp4 in video dataset, and -5vXZfppKE_60.0_210.0.npz in features dataset.

jayleicn commented 1 year ago

Oh sorry, I misunderstood your question. Q1, yes, the videos have a max 150 seconds in length. Q2, yes, we cropped the original videos beginning 60 seconds, into 150 seconds clips, for further annotation. Since there are usually some less relevant information appear in the start of the video, e.g., "please subscribe", promotion, or other high-level summary.

JW-xiilab commented 1 year ago

Thank you once again.

Great, question solved!!