Open tingchihc opened 2 years ago
Hi @ting-chih, sorry for the delayed reply. The model will also need T and V, which can be masked if you need only to input one of them. For example, for only V, T is [CLS][SEP], and for only T, V is all zero. Best~
I want to only input text feature or video feature in UniVL. In this paper, it said that one transformer combines text representation T and video representation V. Could you tell me how to change it to only input T or V into UniVL? thanks
Hi! Do you know how to download the raw videos of YouCook2? Thank you very much!
I want to only input text feature or video feature in UniVL. In this paper, it said that one transformer combines text representation T and video representation V. Could you tell me how to change it to only input T or V into UniVL? thanks