This is the official Pytorch implementation of our paper: "VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection" in AAAI 2024.
Peng Wu, Xuerong Zhou, Guansong Pang, Lingru Zhou, Qingsen Yan, Peng Wang, Yanning Zhang
We present a novel diagram, i.e., VadCLIP, which involves dual branch to detect video anomaly in visual classification and language-visual alignment manners, respectively. With the benefit of dual branch, VadCLIP achieves both coarse-grained and fine-grained WSVAD. To our knowledge, VadCLIP is the first work to efficiently transfer pre-trained language-visual knowledge to WSVAD.
We propose three non-vital components to address new challenges led by the new diagram. LGT-Adapter is used to capture temporal dependencies from different perspectives; Two prompt mechanisms are devised to effectively adapt the frozen pre-trained model to WSVAD task; MIL-Align realizes the optimization of alignment paradigm under weak supervision, so as to preserve the pre-trained knowledge as much as possible.
We show that strength and effectiveness of VadCLIP on two large-scale popular benchmarks, and VadCLIP achieves state-of-the-art performance, e.g., it gets unprecedented results of 84.51\% AP and 88.02\% on XD-Violence and UCF-Crime respectively, surpassing current classification based methods by a large margin.
We extract CLIP features for UCF-Crime and XD-Violence datasets, and release these features and pretrained models as follows:
Benchmark | CLIP[Baidu] | CLIP | Model[Baidu] | Model |
---|---|---|---|---|
UCF-Crime | Code: 7yzp | OneDrive | Code: kq5u | OneDrive |
XD-Violence | Code: v8tw | OneDrive | Code: apw6 | OneDrive |
The following files need to be adapted in order to run the code on your own machine:
list/xd_CLIP_rgb.csv
and list/xd_CLIP_rgbtest.csv
. xd_option.py
After the setup, simply run the following command:
Traing and infer for XD-Violence dataset
python xd_train.py
python xd_test.py
Traing and infer for UCF-Crime dataset
python ucf_train.py
python ucf_test.py
We referenced the repos below for the code.
If you find this repo useful for your research, please consider citing our paper:
@article{wu2023vadclip,
title={Vadclip: Adapting vision-language models for weakly supervised video anomaly detection},
author={Wu, Peng and Zhou, Xuerong and Pang, Guansong and Zhou, Lingru and Yan, Qingsen and Wang, Peng and Zhang, Yanning},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)},
year={2024}
}
@article{wu2023open,
title={Open-Vocabulary Video Anomaly Detection},
author={Wu, Peng and Zhou, Xuerong and Pang, Guansong and Sun, Yujia and Liu, Jing and Wang, Peng and Zhang, Yanning},
journal={arXiv preprint arXiv:2311.07042},
year={2023}
}