simon-ging / coot-videotext

COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Apache License 2.0
288 stars 55 forks source link

About the video captioning #4

Closed PKULiuHui closed 3 years ago

PKULiuHui commented 4 years ago

Hi, thanks for your great work. I was wondering when will you release the code for video captioning, or at least the features so that I can use MART code to generate captions. Looking forward it!

By the way, I have a small question about the application on video captioning. As far as I know, MART uses ground-truth event segments to generate paragraph captions. If I want to generate captions using my predicted event segments, is there a way to use COOT features? Or, if I want to generate a caption (sentence) for a specific video clip (e.g. 10s-20s of a video), is there any way to use COOT features?

simon-ging commented 4 years ago

Hi, thanks for your interest!

1) We are working on releasing the code for captioning, planned release is during September.

2) We will release the features, thanks for your request. In the meantime you could simply create them yourself using our provided models or your own trained model:

In trainer.py at the end of the validate() method, save the features "vid_emb_list" and "clip_emb_list". Save "vid_ids" to know which data IDs you just saved in "vid_emb_list". "clip_emb_list" is a flat tensor of all clips in all videos - to know which clip belongs to which video, you have to collect "clip_num" for each batch into a flat list "clip_nums" during the validation loop. Then in "clip_emb_list", the clips for video N are in range(sum(clip_nums[:N]), sum(clip_nums[:N+1])

E.g. if the first video has 5 clips and the second video has 3 clips, then the respective clip embeddings are saved at positions 0-4 and 5-7.

3) Not sure I understand what you want to do, let's look at a single datapoint:

COOT requires a hierarchical input with a video consisting of several clips (just like MART).

The output will then be one low-level embedding per clip and one high-level embedding for the entire video.

If your segment predictor is good enough, a predicted hierarchy instead of a ground-truth hierarchy may work for producing meaningful embeddings and then feeding them into MART. However you could also just use the original MART method and replace their ground-truth hierarchy with your predictor and see if it works.

Let me know if you share results on a good hierarchy predictor, since that would enable COOT to give good results on videos in the wild without a hierarchy.

PKULiuHui commented 4 years ago

In fact, I just use some techniques in the task of dense video captioning to predict event segments. (e.g. the ESGN model in Streamlined Dense Video Captioning, CVPR19). I have tested feeding the predicted segments into MART, it obtains close score to the results using GT segments.

PKULiuHui commented 4 years ago

The predictor performs well on ActivityNet Captions, but not well on YouCookII, because the latter has more segments (avg. 7.7)

simon-ging commented 4 years ago

Thanks for sharing! Feel free to reopen if you have more questions.

PKULiuHui commented 3 years ago

Any news about the code for captioning?

simon-ging commented 3 years ago

Release is planned for first half of January 2021

simon-ging commented 3 years ago

Check out the latest updates while you wait for the captioning code which will be released in the next 2 weeks.

simon-ging commented 3 years ago

Video captioning code is online!