microsoft / VideoX

VideoX: a collection of video cross-modal models
Other
978 stars 161 forks source link

[X-CLIP] The input of "Video-specific Prompting" #58

Closed Qiliqing closed 2 years ago

Qiliqing commented 2 years ago

Hi, thanks for your great paper.

In the paper Fig.2, it looks like the "Video-specific Prompting" use the output of "Multi-frame Integration Transformer" as visual feature input. But in the implement code, you send the output "img_features" of "Cross-frame Communication Transformer" into "Video-specific Prompting".

Is the picture on the paper wrong?

nbl97 commented 2 years ago

Thanks for your interest in our work and pointing out this. As described in Sec.3.3 of the main paper, the input of Video-specific Prompting is the average of frame features along the temporal dimension. Please refer to the code, and we will revise the confusing figure as soon as possible. Thanks!

nbl97 commented 2 years ago

Pls feel free to ping me if there are further questions.

Qiliqing commented 2 years ago

Thanks for your interest in our work and pointing out this. As described in Sec.3.3 of the main paper, the input of Video-specific Prompting is the average of frame features along the temporal dimension. Please refer to the code, and we will revise the confusing figure as soon as possible. Thanks!

get it, thanks!