Training-free Zero-Shot Video Temporal Grounding using Large-scale Pre-trained Models

In this work, we propose a training-free zero-shot video temporal grounding approach that leverages the ability of pre-trained large models. Our method achieves the best performance on zero-shot video temporal grounding on Charades-STA and ActivityNet Captions datasets without any training and demonstrates better generalization capabilities in cross-dataset and OOD settings.

Our paper was accepted by ECCV-2024.

pipeline

Quick Start

Requiments

pytorch
torchvision
tqdm
salesforce-lavis
sklearn
json5

Data Preparation

To reproduce the results in the paper, we provide the pre-extracted features of the VLM in this link and the outputs of the LLM in dataset/charades-sta/llm_outputs.json and dataset/activitynet/llm_outputs.json. Please download the pre-extracted features and configure the path for these features in data_configs.py file.

Main Results

Standard Split

# Charades-STA dataset
python evaluate.py --dataset charades --llm_output dataset/charades-sta/llm_outputs.json

# ActivityNet dataset
python evaluate.py --dataset activitynet --llm_output dataset/activitynet/llm_outputs.json

Dataset	IoU=0.3	IoU=0.5	IoU=0.7	mIoU
Charades-STA	67.04	49.97	24.32	44.51
ActivityNet	49.34	27.02	13.39	34.10

OOD Splits

# Charades-STA OOD-1
python evaluate.py --dataset charades --split OOD-1

# Charades-STA OOD-2
python evaluate.py --dataset charades --split OOD-2

# ActivityNet OOD-1
python evaluate.py --dataset activitynet --split OOD-1

# ActivityNet OOD-2
python evaluate.py --dataset activitynet --split OOD-2

Dataset	IoU=0.3	IoU=0.5	IoU=0.7	mIoU
Charades-STA OOD-1	66.05	45.91	20.78	43.05
Charades-STA OOD-2	65.75	43.79	19.95	42.62
ActivityNet OOD-1	43.87	20.41	11.25	31.72
ActivityNet OOD-2	40.97	18.54	10.03	30.33

# Charades-CD test-ood
python evaluate.py --dataset charades --split test-ood

# Charades-CG novel-composition
python evaluate.py --dataset charades --split novel-composition

# Charades-CG novel-word
python evaluate.py --dataset charades --split novel-word

Dataset	IoU=0.3	IoU=0.5	IoU=0.7	mIoU
Charades-STA test-ood	65.07	49.24	23.05	44.01
Charades-STA novel-composition	61.53	43.84	18.68	40.19
Charades-STA novel-word	68.49	56.26	28.49	46.90

Test on Custom Datasets

Feature Extraction

Please run feature_extraction.py to obtain the video features of your datasets.

python feature_extraction.py --input_root VIDEO_PATH --save_root FEATURE_SAVE_PATH

Data Configuration

Please add your dataset in the data_configs.py. You may need to adjust the stride and max_stride_factor to achieve better performance.

The format of the annotation file can refer to dataset/charades-sta/test_trivial.json.

Test without LLM

To test the performance with only VLM, please run:

python evaluate.py --dataset DATASET --split SPLIT

DATASET and SPLIT are the dataset name and split that you add in the data_configs.py.

Test with LLM

To obtain the outputs of LLM, please run:

python get_llm_outputs.py --api_key API_KEY --input_file ANNOTATION_FILE --output_file LLM_OUTPUT_FILE

We have implemented models from OpenAI, Google, and Groq. You can specify the model using --model_type and select a specific model with --model_name. You will need to apply for the corresponding model's API key and install the necessary dependencies, such as openai, google-generativeai, or groq.

To test the performance, please run:

python evaluate.py --dataset DATASET --split SPLIT --llm_output LLM_OUTPUT_FILE

minghangz / TFVTG

readme