patrick-tssn / VSTAR

[ACL2023] VSTAR is a multimodal dialogue dataset with scene and topic transition information
https://vstar-benchmark.github.io/
12 stars 2 forks source link
dialogue multimodal scene topic-modeling

VSTAR

This is the official implementation of the ACL 2023 paper "VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions"

Dataset

Schedule

Dialogues

support language: English, 简体中文

Storage: Train (196M); Valid(11.6M); test (24M)

Links: BaiduNetDisk or GoogleDrive

clips dialogues scene/clip topic/clip
Train 172,041 4,319,381 2.42 3.68
Val 9753 250,311 2.64 4.29
Test 9779 250,436 2.56 4.12
{
    "dialogs":[
        {
            "clip_id": "Friends_S01E01_clip_000",
            "dialog": ["hi", ...],
            "scene": [1, 1, 1, 1, 1, 1, 2, 2, ...],
            "session": [1, 1, 1, 2, 2, 2, 3, 3, ...]
        },
        ...
]
}

Feature

Storage: RCNN(246.2G), RESNET(109G)

Links: BaiduNetDisk

File Structure:

# [name of TV show]_S[season]_E[episode]_clip_[clip id].npy
├── Friends_S01E01
   └── Friends_S01E01_clip_000.npy
   └── Friends_S01E01_clip_001.npy
   └── ...
├── ...

ResNet:

# numpy.load("Friends_S01E01_clip_000.npy")
(num_of_frames * 1000)

RCNN:

# numpy.load("Friends_S01E01_clip_000.npy", allow_pickle=True).item()
{
    "feature": (9 * num_of_frames * 2048) # array(float32), feature top 9 objects
    "size": (num_of_frames * 2) # list(int), size of original frame
    "box": (9 * num_of_frames * 4) # array(float32), bbox
    "obj_id": (9 * num_of_frames) # list(int), object id
    "obj_conf": (9 * num_of_frames) # array(float32), object conference 
    "obj_num": (num_of_frames) # list(int), number of objects/frame
}

Please Refer to OpenViDial_extract_features

Installation

pip install -r requirements.txt

Scene Segmentation

move train.json, valid.json, test.json to inputs/full directory

run following script to change the original to binary format to run our baseline smoothly (check in our paper)

cd inputs/full
python preprocess.py
python train_seg.py \
    --video 1 \
    --exp_set EXP_LOG \
    --train_batch_size 4 \
python generate_seg.py \
    --ckptid SAVED_CKPT_ID \
    --gpuid 0 \
    --exp_set EXP_LOG \
    --video 1 \

Topic Segmentation

python train_seg.py \
    --video 0 \
    --exp_set EXP_LOG \
    --train_batch_size 4 \
python generate_seg.py \
    --ckptid SAVED_CKPT_ID \
    --gpuid 0 \
    --exp_set EXP_LOG \
    --video 0 \

Dialogue Generation

To use coco_caption for evaluation, run the following script to generate the reference file:

cd inputs/full
python coco_caption_reformat.py

for the evaluation details, please refer to: https://github.com/tylin/coco-caption

python train_gen.py \
    --train_batch_size 4 \
    --model bart \
    --exp_set EXP_LOG \
    --video 1 \
    --fea_type resnet \
python generate.py \
    --ckptid SAVED_CKPT_ID \
    --gpuid 0 \
    --exp_set EXP_LOG \
    --video 1 \
    --sess 1 \
    --batch_size 4

Citation

@misc{wang2023vstar,
    title={VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions},
    author={Yuxuan Wang and Zilong Zheng and Xueliang Zhao and Jinpeng Li and Yueqian Wang and Dongyan Zhao},
    year={2023},
    eprint={2305.18756},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}