microsoft / UniVL

An official implementation for " UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation"
https://arxiv.org/abs/2002.06353
MIT License
339 stars 54 forks source link

CLip #4

Closed johnbager closed 3 years ago

johnbager commented 3 years ago

do you try to use CLIP to generate video-caption? I think it will be useful.

ArrowLuo commented 3 years ago

@johnbager We have not done any tests on the video-caption-based CLIP. It is worth trying in this direction. The main problem is that the text encoder of CLIP encodes text with a causal attention mask. It leads that video feature (or images feature) may be insufficient for caption task. However, it is just my conjecture.