Video ReCap: Recursive Captioning of Hour-Long Videos\ Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, Gedas Bertasius\ Accepted by CVPR 2024\ [Website] [Paper] [Dataset] [Hugging Face] [Demo]
ViderReCap is a recursive video captioning model that can process very long videos (e.g., hours long) and output video captions at multiple hierarchy levels: short-range clip captions, mid-range segment descriptions, and long-range video summaries. First, the model generates captions for short video clips of a few seconds. As we move up the hierarchy, the model uses sparsely sampled video features and captions generated at the previous hierarchy level as inputs to produce video captions for the current hierarchy level.
See installation.md to install this code.
See datasets.md for the details and download the Ego4D-HCap Dataset.
First, download the pretrained models from this link. Then, you can extract three-levels of hierarchical captions from any video (e.g., assets/example.mp4) using our pretrained models as shown in demo.ipynb
notebook.
We utilize the video encoder of pretrained Dual-Encoder from LaViLa to extract features. \ You can directly download the extracted features (~30 GB) from this link (coming soon). \ Alternatively, you may extract the features on your own using the following steps.
wget https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ego4d/clip_openai_timesformer_base.baseline.ep_0003.pth
bash scripts/extract_features_segments.sh
bash scripts/extract_features_videos.sh
We provide our best model for both Video ReCap and Video ReCap-U. \ Download the pretrained models from this link
bash scripts/eval_video_recap.sh
bash scripts/eval_video_recap_u.sh
You should get the following numbers.
Model | Clip Caption (C/ R/ M) |
Segment Description (C/ R/ M) |
Video Summary (C/ R/ M) |
Checkpoint |
---|---|---|---|---|
Video ReCap | 98.35/ 48.77/ 28.28 | 46.88/ 39.73/ 18.55 | 29.34/ 32.64/ 14.45 | download |
Video ReCap-U | 92.67/ 47.90/ 28.08 | 45.60/ 39.33/ 18.17 | 31.06/ 33.32/ 14.16 | download |
We train our model on 8 V100 GPUs (32GB memory).\ Video ReCap is a recursive model for hierarchical video captioning that uses captions generated at the previous level as input for the current hierarchy. We train Video ReCap utilizing the following curriculum learning strategy.
Download pretrained Dual-Encoder from LaViLa using the following command.
mkdir pretrained_models
cd pretrained_models
wget https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ego4d/clip_openai_timesformer_base.baseline.ep_0003.pth
cd ..
First, train for 5 epochs using the clip captions data.
bash scripts/run_videorecap_clip.sh
Then extract captions at each 4 seconds interval for the whole video using the trained clip captioning model of step 1. Replace the 'captions_pred' of the train and val metadata using generated captions from appropriate time windows (See datasets.md for more details).
bash scripts/extract_captions.sh
Initialize from Video ReCap clip checkpoint and train for 10 epochs using the segment descriptions.
bash scripts/run_videorecap_segment.sh
Extract segment descriptions using the at each 180 seconds interval for the whole video using the trained clip captioning model of step 3. Replace the 'segment_descriptions_pred' of the train and val metadata using generated descriptions from appropriate time windows (See datasets.md for more details).
bash scripts/extract_segment_descriptions.sh
Finally, initialize from Video ReCap segment checkpoint and train for 10 epochs using the video summaries.
bash scripts/run_videorecap_video.sh
We train our model on 8 V100 GPUs (32GB memory).\ While Video ReCap trains three different sets of trainable parameters for three hierarchies, Video ReCap-U trains only one set of trainable parameters. Following curriculum learning scheme with an alternate batching technique allows us to train a unified model and avoid catestrophic foregetting.
Download pretrained Dual-Encoder from LaViLa using the following command.
mkdir pretrained_models
cd pretrained_models
wget https://dl.fbaipublicfiles.com/lavila/checkpoints/dual_encoders/ego4d/clip_openai_timesformer_base.baseline.ep_0003.pth
cd ..
First stage is same as the VideRecap model, where we train for 5 epochs using the clip captions data.
bash scripts/run_videorecap_clip.sh
Then extract captions at each 4 seconds interval for the whole video using the trained clip captioning model of step 1. Replace the 'captions_pred' of the train and val metadata using generated captions from appropriate time windows (See datasets.md for more details).
bash scripts/extract_captions.sh
Secondly, we initialize from Video ReCap clip checkpoint and train for 10 epochs using the segment descriptions and some clip captions data. We sample clip captions and segment descriptions alternatively at each bach.
bash scripts/run_videorecap_clip.sh
Extract segment descriptions using the at each 180 seconds interval for the whole video using the trained clip captioning model of step 3.
bash scripts/extract_segment_descriptions.sh
Finally, we initialize from Video ReCap segment checkpoint and train for 10 epochs using the video summaries and some segment descriptions and clip captions data. We sample data from three hierarchies alternatively at each batch.
bash scripts/run_videorecap_clip.sh
Coming soon!
@article{islam2024video,
title={Video ReCap: Recursive Captioning of Hour-Long Videos},
author={Islam, Md Mohaiminul and Ho, Ngan and Yang, Xitong and Nagarajan, Tushar and
Torresani, Lorenzo and Bertasius, Gedas},
journal={arXiv preprint arXiv:2402.13250},
year={2024}
}