srama2512 / NaQ

NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory. CVPR 2023.
https://vision.cs.utexas.edu/projects/naq/
MIT License
13 stars 1 forks source link
ego4d episodic-memory pytorch vision-and-language

Narrations-as-Queries (NaQ)

This repository contains the official PyTorch implementation for our CVPR 2023 paper:

NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory
Santhosh Kumar Ramakrishnan1        Ziad Al-Halah2        Kristen Grauman1,3
1The University of Texas at Austin        2University of Utah        3FAIR, Meta AI
Project website: http://vision.cs.utexas.edu/projects/naq

Abstract

Searching long egocentric videos with natural language queries (NLQ) has compelling applications in augmented reality and robotics, where a fluid index into everything that a person (agent) has seen before could augment human memory and surface relevant information on demand. However, the structured nature of the learning problem (freeform text query inputs, localized video temporal window outputs) and its needle-in-a-haystack nature makes it both technically challenging and expensive to supervise. We introduce Narrations-as-Queries (NaQ), a data augmentation strategy that transforms standard video-text narrations into training data for a video query localization model. Validating our idea on the Ego4D benchmark, we find it has tremendous impact in practice. NaQ improves multiple top models by substantial margins (even doubling their accuracy), and yields the very best results to date on the Ego4D NLQ challenge, soundly outperforming all challenge winners in the CVPR and ECCV 2022 competitions and topping the current public leaderboard. Beyond achieving the stateof-the-art for NLQ, we also demonstrate unique properties of our approach such as the ability to perform zero-shot and few-shot NLQ, and improved performance on queries about long-tail object categories.

Introduction

Installation

Dataset setup

Benchmarking models on NLQ

We perform NaQ training in two stages: (1) Jointly train on NLQ+NaQ dataset with large-batch training, and (2) Finetune on NLQ dataset with standard VSLNet training. We show an example below to benchmark models on the Ego4D NLQ dataset with EgoVLP features.

VSLNet

Stage 1: Joint training on NLQ+NaQ dataset

cd $NAQ_ROOT
bash VSLNet/scripts/train_naq.sh 0,1,2,3 nlq egovlp experiments/vslnet/egovlp/naq_joint_training 2.5

Stage 2: Fine-tune best checkpoint from stage-1 on NLQ dataset

cd $NAQ_ROOT
PRETRAINED_CKPT=experiments/vslnet/egovlp/naq_joint_training/checkpoints/vslnet_nlq_aug_naq_official_v1_egovlp_128_bert/model/<checkpoint_id>.t7
bash VSLNet/scripts/finetune.sh 0 nlq egovlp experiments/vslnet/egovlp/nlq_finetuning 0.0001 $PRETRAINED_CKPT

Inference:

cd $NAQ_ROOT
bash VSLNet/scripts/infer.sh 0 nlq test egovlp experiments/vslnet/egovlp/nlq_finetuning

For participating in the Ego4D NLQ challenge, submit the inferred predictions at experiments/vslnet/egovlp/nlq_finetuning/checkpoints/vslnet_nlq_official_v1_egovlp_128_bert/model/<checkpoint_id>_test_result.json.

ReLER training

Stage 1: Joint training on NLQ+NaQ dataset

cd $NAQ_ROOT
bash ReLER/scripts/train_naq.sh 0,1,2,3,4,5,6,7 nlq egovlp experiments/reler/egovlp/naq_joint_training 2.5

Stage 2: Fine-tune best checkpoint from stage-1 on NLQ dataset

cd $NAQ_ROOT
PRETRAINED_CKPT=experiments/reler/egovlp/naq_joint_training/video_tef-vlen600_egovlp/model_<checkpoint_id>.t7
bash ReLER/scripts/finetune.sh 0 nlq egovlp experiment/reler/egovlp/nlq_finetuning 0.00001 $PRETRAINED_CKPT

Inference:

cd $NAQ_ROOT
bash ReLER/scripts/infer.sh 0 test egovlp experiments/reler/egovlp/nlq_finetuning/video_tef-vlen600_egovlp/model_<checkpoint_id>.t7

For participating in the Ego4D NLQ challenge, submit the inferred predictions at experiments/reler/egovlp/nlq_finetuning/video_tef-vlen600_egovlp/preds/<checkpoint_id>_test_preds.json.

To train on SlowFast / InternVideo features, replace egovlp with slowfast or internvideo above. To train on TaCOS, replace nlq with tacos.

Pretrained models

We provide models pretrained using NaQ for different combinations of architectures and features here. These checkpoints can be used to reproduce results from the paper.

Ego4D NLQ 2023 challenge

References

Please cite our work if you find our augmentation technique or this codebase useful.

@inproceedings{ramakrishnan2023naq,
    author       = {Ramakrishnan, Santhosh K. and Al-Halah, Ziad and Grauman, Kristen},
    booktitle    = {Computer Vision and Pattern Recognition (CVPR), 2023 IEEE Conference on},
    title        = {NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory},
    year         = {2023},
    organization = {IEEE},
}

We also encourage you to cite these other references depending on whether you use the corresponding dataset, features or architecture.

# Ego4D dataset
@inproceedings{grauman2022ego4d,
  title={Ego4d: Around the world in 3,000 hours of egocentric video},
  author={Grauman, Kristen and Westbury, Andrew and Byrne, Eugene and Chavis, Zachary and Furnari, Antonino and Girdhar, Rohit and Hamburger, Jackson and Jiang, Hao and Liu, Miao and Liu, Xingyu and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={18995--19012},
  year={2022}
}

# SlowFast features
@inproceedings{feichtenhofer2019slowfast,
  title={Slowfast networks for video recognition},
  author={Feichtenhofer, Christoph and Fan, Haoqi and Malik, Jitendra and He, Kaiming},
  booktitle={Proceedings of the IEEE/CVF international conference on computer vision},
  pages={6202--6211},
  year={2019}
}

# EgoVLP features
@inproceedings{linegocentric,
  title={Egocentric Video-Language Pretraining},
  author={Lin, Kevin Qinghong and Wang, Jinpeng and Soldan, Mattia and Wray, Michael and Yan, Rui and Xu, Eric Zhongcong and Gao, Denial and Tu, Rong-Cheng and Zhao, Wenzhe and Kong, Weijie and others},
  booktitle={Advances in Neural Information Processing Systems},
  year={2022}
}

# InternVideo features
@article{chen2022ego4d,
  title={InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges},
  author={Chen, Guo and Xing, Sen and Chen, Zhe and Wang, Yi and Li, Kunchang and Li, Yizhuo and Liu, Yi and Wang, Jiahao and Zheng, Yin-Dong and Huang, Bingkun and others},
  journal={arXiv preprint arXiv:2211.09529},
  year={2022}
}

# CLIP features
@inproceedings{radford2021learning,
  title={Learning transferable visual models from natural language supervision},
  author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and others},
  booktitle={International conference on machine learning},
  pages={8748--8763},
  year={2021},
  organization={PMLR}
}

# VSLNet architecture
@inproceedings{zhang2020span,
    title = "Span-based Localizing Network for Natural Language Video Localization",
    author = "Zhang, Hao  and Sun, Aixin  and Jing, Wei  and Zhou, Joey Tianyi",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.585",
    pages = "6543--6554"
}

# ReLER architecture
@article{liu2022reler,
  title={ReLER@ ZJU-Alibaba Submission to the Ego4D Natural Language Queries Challenge 2022},
  author={Liu, Naiyuan and Wang, Xiaohan and Li, Xiaobo and Yang, Yi and Zhuang, Yueting},
  journal={arXiv preprint arXiv:2207.00383},
  year={2022}
}