Closed Qiliqing closed 2 years ago
Thanks for your interest in our work and pointing out this. As described in Sec.3.3 of the main paper, the input of Video-specific Prompting is the average of frame features along the temporal dimension. Please refer to the code, and we will revise the confusing figure as soon as possible. Thanks!
Pls feel free to ping me if there are further questions.
Thanks for your interest in our work and pointing out this. As described in Sec.3.3 of the main paper, the input of Video-specific Prompting is the average of frame features along the temporal dimension. Please refer to the code, and we will revise the confusing figure as soon as possible. Thanks!
get it, thanks!
Hi, thanks for your great paper.
In the paper Fig.2, it looks like the "Video-specific Prompting" use the output of "Multi-frame Integration Transformer" as visual feature input. But in the implement code, you send the output "img_features" of "Cross-frame Communication Transformer" into "Video-specific Prompting".
Is the picture on the paper wrong?