niluthpol / multimodal_vtt

Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval
69 stars 18 forks source link

Clarification #16

Closed AmeenAli closed 4 years ago

AmeenAli commented 4 years ago

Hi
I would love to get some clarification regarding the MSR-VTT dataset
as I can see for each video we have multiple text descriptions , you have used one of the descriptions for the sake of learning ?
Thanks

niluthpol commented 4 years ago

MSR-VTT has 20 descriptions per video. We have used all descriptions for training.