Closed johnbager closed 3 years ago
@johnbager We have not done any tests on the video-caption-based CLIP. It is worth trying in this direction. The main problem is that the text encoder of CLIP encodes text with a causal attention mask. It leads that video feature (or images feature) may be insufficient for caption task. However, it is just my conjecture.
do you try to use CLIP to generate video-caption? I think it will be useful.