This repository contains the code to reproduce the results of the paper "Where did I leave my keys? — Episodic-Memory-Based Question Answering on Egocentric Videos". See our paper for more details.
Humans have a remarkable ability to organize, compress and retrieve episodic memories throughout their daily life. Current AI systems, however, lack comparable capabilities as they are mostly constrained to an analysis with access to the raw input sequence, assuming an unlimited amount of data storage which is not feasible in realistic deployment scenarios. For instance, existing Video Question Answering (VideoQA) models typically reason over the video while already being aware of the question, thus requiring to store the complete video in case the question is not known in advance.
In this paper, we address this challenge with three main contributions: First, we propose the Episodic Memory Question Answering (EMQA) task as a specialization of VideoQA. Specifically, EMQA models are constrained to keep only a constant-sized representation of the video input, thus automatically limiting the computation requirements at query time. Second, we introduce a new egocentric VideoQA dataset called QaEgo4D. It is the by far largest egocentric VideoQA dataset and video length is unprecedented in VideoQA datasets in general. Third, we present extensive experiments on the new dataset, comparing various baselines models in both the VideoQA as well as the EMQA setting. To facilitate future research on egocentric VideoQA as well as episodic memory representation and retrieval, we publish our code and dataset.
To use the QaEgo4D dataset introduced in our paper, please follow these steps:
python3 tools/create_pure_videoqa_json.py --ego4d /path/to/ego4d --qaego4d /path/to/qaego4d/answers.json
.
/path/to/ego4d
is the directory where you placed the Ego4D download, containing
the v1/annotations/nlq_{train,val}.json
files. This produces /path/to/qaego4d/annotations.{train,val,test}.json
.The annotations.*.json
files are JSON arrays, where each object has the following structure:
{
"video_id": "abcdef00-0000-0000-0000-123456789abc",
"sample_id": "12345678-1234-1234-1234-123456789abc_3",
"question": "Where did I leave my keys?",
"answer": "on the table",
"moment_start_frame": 42,
"moment_end_frame": 53
}
In order to reproduce the experiments, prepare your workspace:
requirements.txt
python tools/extract_ego4d_clip_features.py --annotation_file /path/to/ego4d/v1/annotations/nlq_train.json --video_features_dir /path/to/ego4d/v1/slowfast8x8_r101_k400 --output_dir /choose/your/clip_feature_dir
and do the same again with nlq_val.json
python tools/aggregate_features_to_hdf5.py /choose/your/clip_feature_dir
. This
produces slowfast8x8_r101_k400.hdf5
in the current working directory.annotations.*.json
and slowfast8x8_r101_k400.hdf5
) into datasets/ego4d
.To run an experiment, use bash experiment/run.sh
. All configuration files can be found in the config
dir.
@InProceedings{Baermann_2022_CVPR,
author = {B\"armann, Leonard and Waibel, Alex},
title = {Where Did I Leave My Keys? - Episodic-Memory-Based Question Answering on Egocentric Videos},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2022},
pages = {1560-1568}
}