Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding

NeurIPS 2022, Spotlight Presentation, [arXiv] [BibTeX]

Introduction

We propose STCAT, a new one-stage spatio-temporal video grounding method, which achieved state-of-the-art performance on VidSTG and HC-STVG benchmarks. This repository provides the Pytorch Implementations for the model training and evaluation. For more details, please refer to our paper.

## Dataset Preparation The used datasets are placed in `data` folder with the following structure. ``` data |_ vidstg | |_ videos | | |_ [video name 0].mp4 | | |_ [video name 1].mp4 | | |_ ... | |_ vstg_annos | | |_ train.json | | |_ ... | |_ sent_annos | | |_ train_annotations.json | | |_ ... | |_ data_cache | | |_ ... |_ hc-stvg | |_ v1_video | | |_ [video name 0].mp4 | | |_ [video name 1].mp4 | | |_ ... | |_ annos | | |_ hcstvg_v1 | | | |_ train.json | | | |_ test.json | | data_cache | | |_ ... ``` You can prepare this structure with the following steps: **VidSTG** * Download the video for VidSTG from the [VidOR](https://xdshang.github.io/docs/vidor.html) and put it into `data/vidstg/videos`. The original video download url given by the VidOR dataset provider is broken. You can download the VidSTG videos from [this](https://disk.pku.edu.cn/link/AA93DEAF3BBC694E52ACC5A23A9DC3D03B). * Download the text and temporal annotations from [VidSTG Repo](https://github.com/Guaranteer/VidSTG-Dataset) and put it into `data/vidstg/sent_annos`. * Download the bounding-box annotations from [here](https://disk.pku.edu.cn/link/AA9BD598C845DC43A4B6A0D35268724E4B) and put it into `data/vidstg/vstg_annos`. * For the loading efficiency, we provide the dataset cache for VidSTG at [here](https://disk.pku.edu.cn/link/AAA0FA082DEB3D47FCA92F3BF8775EA3BC). You can download it and put it into `data/vidstg/data_cache`. **HC-STVG** * Download the version 1 of HC-STVG videos and annotations from [HC-STVG](https://github.com/tzhhhh123/HC-STVG). Then put it into `data/hc-stvg/v1_video` and `data/hc-stvg/annos/hcstvg_v1`. * For the loading efficiency, we provide the dataset cache for HC-STVG at [here](https://disk.pku.edu.cn/link/AA66258EA52A1E435B815C4BC10E88925D). You can download it and put it into `data/hc-stvg/data_cache`. ## Setup ### Requirements The code is tested with PyTorch 1.10.0. The other versions may be compatible as well. You can install the requirements with the following commands: ```shell conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge pip install -r requirements.txt ``` Then, download [FFMPEG 4.1.9](https://ffmpeg.org/download.html) and add it to the `PATH` environment variable for loading the video. ### Pretrained Checkpoints Our model leveraged the ResNet-101 pretrained by MDETR as the vision backbone. Please download the pretrained weight from [here](https://github.com/ashkamath/mdetr) and put it into `data/pretrained/pretrained_resnet101_checkpoint.pth`. ## Usage > Note: You should use one video per GPU during training and evaluation, more than one video per GPU is not tested and may cause some bugs. ### Training For training on an 8-GPU node, you can use the following script: ```shell # run for VidSTG python3 -m torch.distributed.launch \ --nproc_per_node=8 \ scripts/train_net.py \ --config-file "experiments/VidSTG/e2e_STCAT_R101_VidSTG.yaml" \ --use-seed \ OUTPUT_DIR data/vidstg/checkpoints/output \ TENSORBOARD_DIR data/vidstg/checkpoints/output/tensorboard \ INPUT.RESOLUTION 448 # run for HC-STVG python3 -m torch.distributed.launch \ --nproc_per_node=8 \ scripts/train_net.py \ --config-file "experiments/HC-STVG/e2e_STCAT_R101_HCSTVG.yaml" \ --use-seed \ OUTPUT_DIR data/hc-stvg/checkpoints/output \ TENSORBOARD_DIR data/hc-stvg/checkpoints/output/tensorboard \ INPUT.RESOLUTION 448 ``` For more training options (like using other hyper-parameters), please modify the configurations `experiments/VidSTG/e2e_STCAT_R101_VidSTG.yaml` and `experiments/HC-STVG/e2e_STCAT_R101_HCSTVG.yaml`. ### Evaluation To evaluate the trained STCAT models, please run the following scripts: ```shell # run for VidSTG python3 -m torch.distributed.launch \ --nproc_per_node=8 \ scripts/test_net.py \ --config-file "experiments/VidSTG/e2e_STCAT_R101_VidSTG.yaml" \ --use-seed \ MODEL.WEIGHT data/vidstg/checkpoints/stcat_res448/vidstg_res448.pth \ OUTPUT_DIR data/vidstg/checkpoints/output \ INPUT.RESOLUTION 448 # run for HC-STVG python3 -m torch.distributed.launch \ --nproc_per_node=8 \ scripts/test_net.py \ --config-file "experiments/HC-STVG/e2e_STCAT_R101_HCSTVG.yaml" \ --use-seed \ MODEL.WEIGHT data/hc-stvg/checkpoints/stcat_res448/hcstvg_res448.pth \ OUTPUT_DIR data/hc-stvg/checkpoints/output \ INPUT.RESOLUTION 448 ``` ## Model Zoo We provide our trained checkpoints with ResNet-101 backbone for results reproducibility. | Dataset | resolution | url | Declarative (m_vIoU/vIoU@0.3/vIoU@0.5) | Interrogative (m_vIoU/vIoU@0.3/vIoU@0.5) | size | |:----:|:-----:|:-----:|:-----:|:-----:|:-----:| | VidSTG | 416 | [Model](https://disk.pku.edu.cn/link/AA2C0A9412722B47FBA3C67FE3314FEAA4) | 32.94/46.07/32.32 | 27.87/38.89/26.07 | 3.1GB | | VidSTG | 448 | [Model](https://disk.pku.edu.cn/link/AA1337478438D4457DAD8FEF817234A04E) | 33.14/46.20/32.58 | 28.22/39.24/26.63 | 3.1GB | | Dataset | resolution | url | m_vIoU/vIoU@0.3/vIoU@0.5 | size | |:----:|:-----:|:-----:|:-----:|:-----:| | HC-STVG | 416 | [Model](https://disk.pku.edu.cn/link/AAE483531815CE4F2484BB5B0A68ED060C) | 34.93/56.64/31.03 |3.1GB | | HC-STVG | 448 | [Model](https://disk.pku.edu.cn/link/AA51A4119F8AA843BEB2B7EC03FEFA82A5) | 35.09/57.67/30.09 |3.1GB | ## Acknowledgement This repo is partly based on the open-source release from [MDETR](https://github.com/ashkamath/mdetr), [DAB-DETR](https://github.com/IDEA-Research/DAB-DETR) and [MaskRCNN-Benchmark](https://github.com/facebookresearch/maskrcnn-benchmark). The evaluation metric implementation is borrowed from [TubeDETR](https://github.com/antoyang/TubeDETR) for a fair comparison. ## License `STCAT` is released under the [MIT license](LICENSE). ## Citation Consider giving this repository a star and cite it in your publications if it helps your research. ``` @article{jin2022embracing, title={Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding}, author={Jin, Yang and Li, Yongzhi and Yuan, Zehuan and Mu, Yadong}, journal={arXiv preprint arXiv:2209.13306}, year={2022} } ```

jy0205 / STCAT

readme

Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video Grounding

Introduction