houzhijian / GroundNLQ

The champion solution for Ego4D Natural Language Queries Challenge in CVPR 2023
MIT License
egocentric-vision video-language-understanding

GroundNLQ @ Ego4D Natural Language Queries Challenge 2023

Techical report

TL;DR: GroundNLQ won the first place at the Ego4D Natural Language Queries Challenge at CVPR23. Technically, we leverage a two-stage pre-training strategy to train egocentric feature extractors and the grounding model on video narrations, and further fine-tune the model on annotated data. In addition, we introduce a novel grounding model GroundNLQ, which employs a multi-modal multi-scale grounding module for effective video and text fusion and various temporal intervals, especially for long videos.

This repo supports data pre-processing, training and evaluation of the Ego4D-NLQ dataset.


Table of Contents





The structure of this code repo is heavily inspired by Detectron2. Some of the main components are


We adopt distributed data parallel DDP and fault-tolerant distributed training with torchrun.


Training can be launched by running the following command:

bash tools/pretrain_ego4d_narration.sh CONFIG_FILE OUTPUT_PATH 

where CONFIG_FILE is the config file for model/dataset hyperparameter initialization, OUTPUT_PATH is the model output directory name defined by yourself.

The checkpoints and other experiment log files will be written into ckpt/OUTPUT_PATH. For more configurable options, please check our config file libs/core/config.py.

The actual command used in the experiments is

bash tools/pretrain_ego4d_narration.sh configs/ego4d_narrations_nlq_internvideo.yaml pretrain


Training can be launched by running the following command:

bash tools/train ego4d_finetune_head_twogpu.sh CONFIG_FILE OUTPUT_PATH CUDA_DEVICE_ID

where CUDA_DEVICE_ID is cuda device id.

The actual command used in the experiments is

bash tools/train_ego4d_finetune_head_twogpu.sh configs/ego4d_nlq_v2_internvideo_1e-4.yaml scratch_2gpu 0,1


Training can be launched by running the following command:

bash tools/train ego4d_finetune_head_onegpu.sh CONFIG_FILE RESUME_PATH OUTPUT_PATH CUDA_DEVICE_ID

where RESUME_PATH is the path of the pretrained model weights.

The actual command used in the experiments is

bash tools/train_ego4d_finetune_head_onegpu.sh configs/ego4d_nlq_v2_pretrain_finetune_internvideo_2.5e-5.yaml \ 
/s1_md0/leiji/v-zhijian/ego4d_nlq_cvpr_2023_data/pretrain_weights/epoch_005.pth.tar finetune_1gpu 0
bash tools/train_ego4d_finetune_head_onegpu.sh configs/ego4d_nlq_v2_pretrain_finetune_internvideo_2.5e-5_train+val.yaml \ 
/s1_md0/leiji/v-zhijian/ego4d_nlq_cvpr_2023_data/pretrain_weights/epoch_005.pth.tar finetune_1gpu 0


Once the model is trained, you can use the following commands for inference:

bash tools/inference_ego4d_nlq.sh CONFIG_FILE CHECKPOINT_PATH CUDA_DEVICE_ID 

where CHECKPOINT_PATH is the path to the saved checkpoint.

Metric \ Method R@1 IoU=0.3 R@5 IoU=0.3 R@1 IoU=0.5 R@5 IoU=0.5
GroundNLQ (from scratch) 16.74 39.02 11.47 27.39
GroundNLQ (finetune) 26.98 53.56 18.83 40.00


We conduct post-model prediction ensemble to enhance performance for leaderboard submission. The actual command used in the experiments is

python ensemble.py


This repo is maintained by Zhijian Hou. Questions and discussions are welcome via zjhou3-c@my.cityu.edu.hk.


This code is inspired by ActionFormer. We use the extracted egocentric InternVideo and EgoVLP features from the NaQ authors. We thank the authors for their awesome open-source contributions.