VSTAR

This is the official implementation of the ACL 2023 paper "VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions"

Dataset

Schedule

[X] release dialogues
[X] release feature (resnet, rcnn)
[X] release test data (2024.07.17)
[ ] release meta data (genres, keywords, storyline, characters: name, avatar)
[ ] release frames

Dialogues

support language: English, 简体中文

Downloads

Storage: Train (196M); Valid(11.6M); test (24M)

Links: BaiduNetDisk or GoogleDrive

Statistics

	clips	dialogues	scene/clip	topic/clip
Train	172,041	4,319,381	2.42	3.68
Val	9753	250,311	2.64	4.29
Test	9779	250,436	2.56	4.12

Format

{
    "dialogs":[
        {
            "clip_id": "Friends_S01E01_clip_000",
            "dialog": ["hi", ...],
            "scene": [1, 1, 1, 1, 1, 1, 2, 2, ...],
            "session": [1, 1, 1, 2, 2, 2, 3, 3, ...]
        },
        ...
]
}

Feature

Downloads

Storage: RCNN(246.2G), RESNET(109G)

Links: BaiduNetDisk

Format

File Structure:

# [name of TV show]_S[season]_E[episode]_clip_[clip id].npy
├── Friends_S01E01
   └── Friends_S01E01_clip_000.npy
   └── Friends_S01E01_clip_001.npy
   └── ...
├── ...

ResNet:

# numpy.load("Friends_S01E01_clip_000.npy")
(num_of_frames * 1000)

RCNN:

# numpy.load("Friends_S01E01_clip_000.npy", allow_pickle=True).item()
{
    "feature": (9 * num_of_frames * 2048) # array(float32), feature top 9 objects
    "size": (num_of_frames * 2) # list(int), size of original frame
    "box": (9 * num_of_frames * 4) # array(float32), bbox
    "obj_id": (9 * num_of_frames) # list(int), object id
    "obj_conf": (9 * num_of_frames) # array(float32), object conference 
    "obj_num": (num_of_frames) # list(int), number of objects/frame
}

Feature Extraction Tools

Please Refer to OpenViDial_extract_features

Installation

pip install -r requirements.txt

Scene Segmentation

Preprocess

move train.json, valid.json, test.json to inputs/full directory

run following script to change the original to binary format to run our baseline smoothly (check in our paper)

cd inputs/full
python preprocess.py

Train

python train_seg.py \
    --video 1 \
    --exp_set EXP_LOG \
    --train_batch_size 4 \

Infer

python generate_seg.py \
    --ckptid SAVED_CKPT_ID \
    --gpuid 0 \
    --exp_set EXP_LOG \
    --video 1 \

Topic Segmentation

Train

python train_seg.py \
    --video 0 \
    --exp_set EXP_LOG \
    --train_batch_size 4 \

Infer

python generate_seg.py \
    --ckptid SAVED_CKPT_ID \
    --gpuid 0 \
    --exp_set EXP_LOG \
    --video 0 \

Dialogue Generation

To use coco_caption for evaluation, run the following script to generate the reference file:

cd inputs/full
python coco_caption_reformat.py

for the evaluation details, please refer to: https://github.com/tylin/coco-caption

Train

python train_gen.py \
    --train_batch_size 4 \
    --model bart \
    --exp_set EXP_LOG \
    --video 1 \
    --fea_type resnet \

Infer

python generate.py \
    --ckptid SAVED_CKPT_ID \
    --gpuid 0 \
    --exp_set EXP_LOG \
    --video 1 \
    --sess 1 \
    --batch_size 4

Citation

@misc{wang2023vstar,
    title={VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions},
    author={Yuxuan Wang and Zilong Zheng and Xueliang Zhao and Jinpeng Li and Yueqian Wang and Dongyan Zhao},
    year={2023},
    eprint={2305.18756},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

patrick-tssn / VSTAR

readme