Closed PKULiuHui closed 3 years ago
Hi, thanks for your interest!
1) We are working on releasing the code for captioning, planned release is during September.
2) We will release the features, thanks for your request. In the meantime you could simply create them yourself using our provided models or your own trained model:
In trainer.py at the end of the validate() method, save the features "vid_emb_list" and "clip_emb_list". Save "vid_ids" to know which data IDs you just saved in "vid_emb_list". "clip_emb_list" is a flat tensor of all clips in all videos - to know which clip belongs to which video, you have to collect "clip_num" for each batch into a flat list "clip_nums" during the validation loop. Then in "clip_emb_list", the clips for video N are in range(sum(clip_nums[:N]), sum(clip_nums[:N+1])
E.g. if the first video has 5 clips and the second video has 3 clips, then the respective clip embeddings are saved at positions 0-4 and 5-7.
3) Not sure I understand what you want to do, let's look at a single datapoint:
COOT requires a hierarchical input with a video consisting of several clips (just like MART).
The output will then be one low-level embedding per clip and one high-level embedding for the entire video.
If your segment predictor is good enough, a predicted hierarchy instead of a ground-truth hierarchy may work for producing meaningful embeddings and then feeding them into MART. However you could also just use the original MART method and replace their ground-truth hierarchy with your predictor and see if it works.
Let me know if you share results on a good hierarchy predictor, since that would enable COOT to give good results on videos in the wild without a hierarchy.
In fact, I just use some techniques in the task of dense video captioning to predict event segments. (e.g. the ESGN model in Streamlined Dense Video Captioning, CVPR19). I have tested feeding the predicted segments into MART, it obtains close score to the results using GT segments.
The predictor performs well on ActivityNet Captions, but not well on YouCookII, because the latter has more segments (avg. 7.7)
Thanks for sharing! Feel free to reopen if you have more questions.
Any news about the code for captioning?
Release is planned for first half of January 2021
Check out the latest updates while you wait for the captioning code which will be released in the next 2 weeks.
Video captioning code is online!
Hi, thanks for your great work. I was wondering when will you release the code for video captioning, or at least the features so that I can use MART code to generate captions. Looking forward it!
By the way, I have a small question about the application on video captioning. As far as I know, MART uses ground-truth event segments to generate paragraph captions. If I want to generate captions using my predicted event segments, is there a way to use COOT features? Or, if I want to generate a caption (sentence) for a specific video clip (e.g. 10s-20s of a video), is there any way to use COOT features?