Closed qjyyyy closed 2 years ago
I don't know if I misunderstood the paper.
Hi there, thank you so much for taking interest in our work! Yes, your understanding is correct. This kind of interaction allows us to generate a video representation that is most relevant to the given text. It is allowed since we are only using information about a given text-video pair to generate our aggregated video representation. For text-to-video retrieval, we would generate such a video representation using every video in our index set and for video-to-text retrieval, we would generate such a video representation using every text in our index set. Does this answer your question? Thanks
Thank you for your patience in answering. The interaction of text and video is ingenious. But will it be computationally expensive during testing, because all possible text-video pairs need to be input.
We have a section on this question of computation in our appendix if you're interested. In short, in a large-scale production system, we can use re-ranking to very efficiently scale X-Pool while maintaining strong performance.
I get it! Thanks again for your patience!
Hello, I have benefited a lot after reading your paper. However, I have a question about the cross-modal language-video attention.
The query is obtained from the text embedding, the key and value are obtained from the frame embedding. Then the scaled dot product attention is utilized to get the aggregated video representation. This means that the aggregated video representation accepts information from text embedding. This will make the similarity score of the two naturally high.
So I wonder if this kind of interaction is allowed in text-video cross-modal retrieval. Looking forward to your apply. Thanks!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!