thu-ml / Attack-Bard

85 stars 6 forks source link

Attack-Bard

News


[2023/10/14] We have updated the results on GPT-4V. The attack success rate is 45%!.

Introduction


Multimodal Large Language Models (MLLMs) that integrate text and other modalities (especially vision) have achieved unprecedented performance in various multimodal tasks. However, due to the unsolved adversarial robustness problem of vision models, MLLMs can have more severe safety and security risks by introducing the vision inputs. In this work, we study the adversarial robustness of Google's Bard, a competitive chatbot to ChatGPT that released its multimodal capability recently, to better understand the vulnerabilities of commercial MLLMs. By attacking white-box surrogate vision encoders or MLLMs, the generated adversarial examples can mislead Bard to output wrong image descriptions with a 22% success rate based solely on the transferability. We show that the adversarial examples can also attack other MLLMs, e.g., 26% attack success rate against Bing Chat and 86\% attack success rate against ERNIE bot. Moreover, we identify two defense mechanisms of Bard, including face detection and toxicity detection of images. We design corresponding attacks to evade these defenses, demonstrating that the current defenses of Bard are also vulnerable. We hope this work can deepen our understanding on the robustness of MLLMs and facilitate future research on defenses.

image

Getting Started


Installation

The installation of this project is extremely easy. You only need to:

and run the following codes

We also provide adversarial examples crafted by image embedding attack in ssa-cwa-200. You can try them on other models.

Results

Attack Success Rate Rejection Rate
No Attack 0\% 1\%
Image Embedding Attack 22\% 5\%
Text Description Attack 10\% 1\%
Attack Success Rate
GPT-4 45\%
Bing Chat 26\%
ERNIE Bot 86\%

image

image

image

image

image

image

image

image

image

image

image

image

Acknowledgement


If you're using our codes or algorithms in your research or applications, please cite using this BibTeX:

@article{dong2023robust,
  title={How Robust is Google's Bard to Adversarial Image Attacks?},
  author={Dong, Yinpeng and Chen, Huanran and Chen, Jiawei and Fang, Zhengwei and Yang, Xiao and Zhang, Yichi and Tian, Yu and Su, Hang and Zhu, Jun},
  journal={arXiv preprint arXiv:2309.11751},
  year={2023}
}

Our code is implemented based on MiniGPT4 and AdversarialAttacks. Thanks them for supporting!