layer6ai-labs / xpool

https://layer6ai-labs.github.io/xpool/
110 stars 9 forks source link

Question About Cross-Modal Language-Video Attention #1

Closed qjyyyy closed 2 years ago

qjyyyy commented 2 years ago

Hello, I have benefited a lot after reading your paper. However, I have a question about the cross-modal language-video attention.

The query is obtained from the text embedding, the key and value are obtained from the frame embedding. Then the scaled dot product attention is utilized to get the aggregated video representation. This means that the aggregated video representation accepts information from text embedding. This will make the similarity score of the two naturally high.

So I wonder if this kind of interaction is allowed in text-video cross-modal retrieval. Looking forward to your apply. Thanks!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

qjyyyy commented 2 years ago

I don't know if I misunderstood the paper.

NoelVouitsis commented 2 years ago

Hi there, thank you so much for taking interest in our work! Yes, your understanding is correct. This kind of interaction allows us to generate a video representation that is most relevant to the given text. It is allowed since we are only using information about a given text-video pair to generate our aggregated video representation. For text-to-video retrieval, we would generate such a video representation using every video in our index set and for video-to-text retrieval, we would generate such a video representation using every text in our index set. Does this answer your question? Thanks

qjyyyy commented 2 years ago

Thank you for your patience in answering. The interaction of text and video is ingenious. But will it be computationally expensive during testing, because all possible text-video pairs need to be input.

NoelVouitsis commented 2 years ago

We have a section on this question of computation in our appendix if you're interested. In short, in a large-scale production system, we can use re-ranking to very efficiently scale X-Pool while maintaining strong performance.

qjyyyy commented 2 years ago

I get it! Thanks again for your patience!