microsoft / VideoX

VideoX: a collection of video cross-modal models
Other
966 stars 160 forks source link

Processing text with specific video #81

Closed nattikahana closed 1 year ago

nattikahana commented 1 year ago

Hi, I read your article of xclip and first of all I would like to say it's fascinating, second I would like to ask about the multi-head self-attention my purpose is to have a database of all video embeddings and then when I search something with text it will search with the only so I can't specify with which video to process it so I would like to know if there is a way to skip the part of multi-head self-attention. Thanks a lot.

nbl97 commented 1 year ago

Thanks for your interest. The video-specific prompting is designed for enhancing the text representation that only contains the limited label information. In your project, I think the simplest way is to remove the prompting mechanism, including the multi-head attention and FFN. You may need to remove some related code manually. Pls free feel to ping me if there are further questions.