π‘ Project Page | π Paper | π€ Dataset | π€ Checkpoints
This paper considers the problem of Multi-Hop Video Question Answering (MH-VidQA) in long-form egocentric videos. This task not only requires to answer visual questions, but also to localize multiple relevant time intervals within the video as visual evidences.
We develop an automated pipeline to mine multi-hop question-answering pairs with associated temporal evidence, enabling to construct a large-scale dataset for instruction-tuning. We then propose a novel architecture, termed as GeLM, to leverage the world knowledge reasoning capabilities of multi-modal large language models (LLMs), while incorporating a grounding module to retrieve temporal evidence in the video with flexible grounding tokens.
MultiHop-EgoQA/
βββ baseline/ # Our Baseline Method
β βββ checkpoints/ # Checkpoints of LLMs
β β βββvicuna-v1-3-7b/
β βββ datasets/ # Save path of datasets
β β βββ multihop_qa/
β β β βββ features/
β β β βββ train_annotations.json
β β βββ activitynet-captions/
β β β βββ intern_feature/
β β β βββ val_1.json
β β βββ temporal_reasoning/
β βββ gelm/ # Implementation of the GeLM model
β βββ llava/ # LLaVa code base
β βββ scripts/ # Scripts for evaluating the baseline method
β β βββ eval_multihop_qa.sh # Evaluate GeLM on MultiHop-EgoQA
β β βββ eval_rtl.sh # Evaluate GeLM on ActivityNet-RTL
β βββ pyproject.toml # Configuration file
|
βββ benchmark/ # Benchmarking tools and metrics
β βββ metrics/ # Metrics calculation
β βββ zero-shot-inference/ # Zero-shot inference codes
See Dataset Preparation.
Training setup: Ubuntu 18.04, CUDA 12.1, 4x Nvidia H800 (80GB)
cd baseline
conda create -n gelm python=3.10 -y
conda activate gelm
pip install --upgrade pip # enable PEP 660 support
pip install -e .
pip install ninja
pip install flash-attn --no-build-isolation
Downloading LLM checkpoints and saving under checkpoints
.
git clone https://huggingface.co/lmsys/vicuna-13b-v1.3
Training.
# Training on MultiHop-EgoQA
bash scripts/finetune_multihop_qa.sh
bash scripts/finetune_rtl.sh
bash scripts/finetune_mixed.sh
### Checkpoints
We provide the checkpoints of GeLM-7B trained on MultiHop-EgoQA and ActivityNet-RTL on [Hugging face](https://huggingface.co/SurplusDeficit/GeLM), respectively.
### Evaluation
1. Evaluation on MultiHop-EgoQA
```bash
cd benchmark/metrics
bash evaluate.sh
cd baseline
bash eval_rtl.sh
Our baseline method implementation is adapted from the LITA.
The implementation of the zero-shot evaluation code references the official repositories of TimeChat and VTimeLLM, as well as the Hugging Face documentation of InternVL2, LLaVa-NeXT-Video, LLaVa-v1.6, Meta-Llama-3.1, and the documentation of OpenAI GPT-4o for video understanding.