This is the official implementation of Flipped-VQA (EMNLP 2023) (arxiv) (demo).
Dohwan Ko1, Ji Soo Lee1, Wooyoung Kang2, Byungseok Roh2, Hyunwoo J. Kim1.
1Department of Computer Science and Engineering, Korea University 2Kakao Brain
To install requirements, run:
git clone https://github.com/mlvlab/Flipped-VQA.git
cd Flipped-VQA
mkdir pretrained
conda create -n flipped-vqa python=3.8
conda activate flipped-vqa
sh setup.sh
git lfs install
git clone https://huggingface.co/datasets/ikodoh/Flipped-VQA-Data
mv ./Flipped-VQA-Data/data ./
mv ./Flipped-VQA-Data/checkpoint ./
unzip ./data/tvqa/tvqa_subtitles.zip -d ./data/tvqa
rm -rf Flipped-VQA-Data ./data/tvqa/tvqa_subtitles.zip
./pretrained
../pretrained
└─ llama
|─ 7B
| |─ consolidated.00.pth
| └─ params.json
|─ 13B
| :
|─ 33B
| :
└─ tokenizer.model
torchrun --rdzv_endpoint 127.0.0.1:1234 --nproc_per_node 4 train.py --model 7B \
--max_seq_len 128 --batch_size 8 --epochs 5 --warmup_epochs 2 --bias 3.5 --tau 100. --max_feats 10 --dataset nextqa \
--blr 9e-2 --weight_decay 0.14 --output_dir ./checkpoint/nextqa --accum_iter 2 --vaq --qav
torchrun --rdzv_endpoint 127.0.0.1:1234 --nproc_per_node 4 train.py --model 7B \
--max_seq_len 128 --batch_size 8 --epochs 5 --warmup_epochs 2 --bias 3 --tau 100. --max_feats 10 --dataset star \
--blr 9e-2 --weight_decay 0.16 --output_dir ./checkpoint/star --accum_iter 1 --vaq --qav
torchrun --rdzv_endpoint 127.0.0.1:1234 --nproc_per_node 4 train.py --model 7B \
--max_seq_len 384 --batch_size 2 --epochs 5 --warmup_epochs 2 --bias 3 --tau 100. --max_feats 10 --dataset dramaqa \
--blr 9e-2 --weight_decay 0.10 --output_dir ./checkpoint/dramaqa --accum_iter 8 --vaq --qav
torchrun --rdzv_endpoint 127.0.0.1:1234 --nproc_per_node 4 train.py --model 7B \
--max_seq_len 256 --batch_size 4 --epochs 5 --warmup_epochs 2 --bias 3 --tau 100. --max_feats 10 --dataset vlep \
--blr 6e-2 --weight_decay 0.20 --output_dir ./checkpoint/vlep --accum_iter 8 --sub --qav
torchrun --rdzv_endpoint 127.0.0.1:1234 --nproc_per_node 8 train.py --model 7B \
--max_seq_len 650 --batch_size 1 --epochs 5 --warmup_epochs 2 --bias 3 --tau 100. --max_feats 10 --dataset tvqa \
--blr 7e-2 --weight_decay 0.02 --output_dir ./checkpoint/tvqa --dataset tvqa --accum_iter 4 --sub --vaq --qav
The fine-tuned checkpoints on each dataset are here.
From the training command, simply replace train.py
with eval.py
and add --resume ./your/checkpoint.pth
.
This repo is built upon LLaMA-Adapter.
@inproceedings{ko2023large,
title={Large Language Models are Temporal and Causal Reasoners for Video Question Answering},
author={Ko, Dohwan and Lee, Ji Soo and Kang, Wooyoung and Roh, Byungseok and Kim, Hyunwoo J},
booktitle={EMNLP},
year={2023}
}