simon-ging / coot-videotext

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Apache License 2.0
287 stars 55 forks source link

How can I use this model to make video caption for a raw dataset #45

Closed Karshilov closed 2 years ago

Karshilov commented 2 years ago

Background

I'm new to AI, and now I need to do video caption with a dataset which has several video and for each video it has some sentences to describe it(the whole video but not for segments).

Problem

In this raw dataset I have no ready-made features and also pretrained model can't work because the result should be chinese. All I have it the video and sentences. How can I use this model to solve this problem? Or if it's too hard, just give me a negative response.

Thanks for your answer.

simon-ging commented 2 years ago

Hi, the COOT model works best when you have video segments and then one sentence per segment. Since you don't have that maybe it does not make sense to train it. I suggest to use this S3D model https://github.com/antoine77340/S3D_HowTo100M or any other video pretrained model to extract features. Then use MART to generate one sentence at a time, either use our version or the original author's version here https://github.com/jayleicn/recurrent-transformer

Karshilov commented 2 years ago

Thanks for your advice, I'd like to have a try! : )