Question about the video encoder ViT

jayleicn / moment_detr

[NeurIPS 2021] Moment-DETR code and QVHighlights dataset

https://arxiv.org/abs/2107.09609

MIT License

257 stars 44 forks source link

Question about the video encoder ViT #42

Open Summer-seu opened 11 months ago

Summer-seu commented 11 months ago

Hi，thanks for your great works! I have a question that how you fuse the image features from a 2-seconds clip into a clip video feature, since ViT is a feature extraction model for images not videos.

jayleicn commented 10 months ago

We sample a video frame (an image) every 2 seconds and extract embedding for it.