In this work, we propose a training-free zero-shot video temporal grounding approach that leverages the ability of pre-trained large models. Our method achieves the best performance on zero-shot video temporal grounding on Charades-STA and ActivityNet Captions datasets without any training and demonstrates better generalization capabilities in cross-dataset and OOD settings.
Our paper was accepted by ECCV-2024.
To reproduce the results in the paper, we provide the pre-extracted features of the VLM in this link and the outputs of the LLM in dataset/charades-sta/llm_outputs.json
and dataset/activitynet/llm_outputs.json
. Please download the pre-extracted features and configure the path for these features in data_configs.py
file.
# Charades-STA dataset
python evaluate.py --dataset charades --llm_output dataset/charades-sta/llm_outputs.json
# ActivityNet dataset
python evaluate.py --dataset activitynet --llm_output dataset/activitynet/llm_outputs.json
Dataset | IoU=0.3 | IoU=0.5 | IoU=0.7 | mIoU |
---|---|---|---|---|
Charades-STA | 67.04 | 49.97 | 24.32 | 44.51 |
ActivityNet | 49.34 | 27.02 | 13.39 | 34.10 |
# Charades-STA OOD-1
python evaluate.py --dataset charades --split OOD-1
# Charades-STA OOD-2
python evaluate.py --dataset charades --split OOD-2
# ActivityNet OOD-1
python evaluate.py --dataset activitynet --split OOD-1
# ActivityNet OOD-2
python evaluate.py --dataset activitynet --split OOD-2
Dataset | IoU=0.3 | IoU=0.5 | IoU=0.7 | mIoU |
---|---|---|---|---|
Charades-STA OOD-1 | 66.05 | 45.91 | 20.78 | 43.05 |
Charades-STA OOD-2 | 65.75 | 43.79 | 19.95 | 42.62 |
ActivityNet OOD-1 | 43.87 | 20.41 | 11.25 | 31.72 |
ActivityNet OOD-2 | 40.97 | 18.54 | 10.03 | 30.33 |
# Charades-CD test-ood
python evaluate.py --dataset charades --split test-ood
# Charades-CG novel-composition
python evaluate.py --dataset charades --split novel-composition
# Charades-CG novel-word
python evaluate.py --dataset charades --split novel-word
Dataset | IoU=0.3 | IoU=0.5 | IoU=0.7 | mIoU |
---|---|---|---|---|
Charades-STA test-ood | 65.07 | 49.24 | 23.05 | 44.01 |
Charades-STA novel-composition | 61.53 | 43.84 | 18.68 | 40.19 |
Charades-STA novel-word | 68.49 | 56.26 | 28.49 | 46.90 |
Please run feature_extraction.py
to obtain the video features of your datasets.
python feature_extraction.py --input_root VIDEO_PATH --save_root FEATURE_SAVE_PATH
Please add your dataset in the data_configs.py
. You may need to adjust the stride and max_stride_factor to achieve better performance.
The format of the annotation file can refer to dataset/charades-sta/test_trivial.json
.
To test the performance with only VLM, please run:
python evaluate.py --dataset DATASET --split SPLIT
DATASET
and SPLIT
are the dataset name and split that you add in the data_configs.py
.
To obtain the outputs of LLM, please run:
python get_llm_outputs.py --api_key API_KEY --input_file ANNOTATION_FILE --output_file LLM_OUTPUT_FILE
We have implemented models from OpenAI, Google, and Groq. You can specify the model using --model_type
and select a specific model with --model_name
. You will need to apply for the corresponding model's API key and install the necessary dependencies, such as openai
, google-generativeai
, or groq
.
To test the performance, please run:
python evaluate.py --dataset DATASET --split SPLIT --llm_output LLM_OUTPUT_FILE