simon-ging / coot-videotext

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Apache License 2.0
286 stars 54 forks source link

Running the code for videos #55

Open ShaadAkhtar opened 9 months ago

ShaadAkhtar commented 9 months ago

I am doing violence detection using video captioning. If I give your model a number of videos containing some type of violence will it be able to tell that in captions?. Example if a tree is on fire in a video or if a roberry is taking place in a video then will your model be able to tell using captions that 'A tree is on fire' and 'A roberry/armed roberry is taking place. I don't have captions for videos. I only have videos and images without caption so I was hoping to generate training and testing data with the help of your model and then make my own video captioning model.

simon-ging commented 9 months ago

You would probably need to train the model to do this, since those activities are not in the training set of ActivityNet Captions, so no.