simon-ging / coot-videotext

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Apache License 2.0
288 stars 55 forks source link

Using the model to predict video caption #50

Open YoussefZiad opened 2 years ago

YoussefZiad commented 2 years ago

Hello, I am trying to use the model to generate captions for external .mp4 videos and I was wondering if you could give me any pointers about how one would go about it and which functions are relevant. Thank you in advance!

simon-ging commented 2 years ago

Hi, it depends on what model you want to use (trained on ActivityNet or trained on YouCook2)

For YouCook2 see: https://github.com/gingsi/coot-videotext/issues/17

For ActivityNet we used the features provided by the authors of the CMHSE paper https://github.com/Sha-Lab/CMHSE so you would have to research their paper or code to find out how to extract the features. Kindly post here if you find the solution.

Best

erdeme36 commented 2 years ago

@gingsi I also want to try coot with my own dataset. Did anyone can succeed that ?

erdeme36 commented 2 years ago

Also the model trained on YouCook2

simon-ging commented 2 years ago

I added the feature extraction code now, see the readme chapter "Running your own video dataset on the trained models". With it you can create the Howto100m features based on mpg files.