microsoft / UniVL

An official implementation for " UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation"
https://arxiv.org/abs/2002.06353
MIT License
339 stars 54 forks source link

Joint loss in pretraining #21

Open zhangliang-04 opened 3 years ago

zhangliang-04 commented 3 years ago

Hi, We found that video text joint loss in pretraining is calculated from masked video and text. Why not use the origin video and text like retrieval finetune? https://github.com/microsoft/UniVL/blob/0a7c07f566a3b220731f4abcaa6e1ee59a686596/modules/modeling.py#L258

ArrowLuo commented 3 years ago

Hi @zhangliang-04, we use the masked sequences for the consistency of other losses. An elaborate design for the retrieval task may benefit from a non-masked version, however, we have not tested on it. Maybe it can improve performance further.