Open zhangliang-04 opened 3 years ago
Hi @zhangliang-04, we use the masked sequences for the consistency of other losses. An elaborate design for the retrieval task may benefit from a non-masked version, however, we have not tested on it. Maybe it can improve performance further.
Hi, We found that video text joint loss in pretraining is calculated from masked video and text. Why not use the origin video and text like retrieval finetune? https://github.com/microsoft/UniVL/blob/0a7c07f566a3b220731f4abcaa6e1ee59a686596/modules/modeling.py#L258