Closed Karshilov closed 2 years ago
Hi, the COOT model works best when you have video segments and then one sentence per segment. Since you don't have that maybe it does not make sense to train it. I suggest to use this S3D model https://github.com/antoine77340/S3D_HowTo100M or any other video pretrained model to extract features. Then use MART to generate one sentence at a time, either use our version or the original author's version here https://github.com/jayleicn/recurrent-transformer
Thanks for your advice, I'd like to have a try! : )
Background
I'm new to AI, and now I need to do video caption with a dataset which has several video and for each video it has some sentences to describe it(the whole video but not for segments).
Problem
In this raw dataset I have no ready-made features and also pretrained model can't work because the result should be chinese. All I have it the video and sentences. How can I use this model to solve this problem? Or if it's too hard, just give me a negative response.
Thanks for your answer.