uw-nsl / SafeDecoding

Official Repository for ACL 2024 Paper SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
https://arxiv.org/abs/2402.08983
MIT License
87 stars 5 forks source link

About adaptive attack #2

Closed LetheSec closed 6 months ago

LetheSec commented 6 months ago

Hi, thanks for the great work!

I would like to know if you has carried out some adaptive attack experiments? For example, assuming that an attacker generate malicious prompts (e.g., using GCG) on a LLM using SafeDecoding strategy. What is the defense effect ?

Specifically,

fly-dust commented 6 months ago

Hi Xiaojian,

Thanks for asking! This is an excellent question.

In this paper, we follow the threat models of existing defenses against jailbreak attacks (e.g., PPL, ICD, Self-Exam, Self-Reminder) and do not consider adaptive attackers in our threat model. However, we agree that an adaptive attacker may successfully bypass SafeDecoding, and we mentioned potential mitigation strategies in the Ethical Statement section of our paper. We believe that defending against adaptive jailbreak attackers is a promising future research direction, and we are actively investigating it.

Regarding the GCG attack, we don't think it can be an adaptive attacker, as it requires calculating token gradients specific to one model. However, SafeDecoding utilizes two different models for initial tokens, which makes calculating token gradients technically challenging. Nevertheless, for black-box attacks, such as PAIR, implementing adaptive attacks is possible.

Please feel free to reach out to me via email if you have any further questions or would like to discuss this topic in more detail. Thanks!

LetheSec commented 6 months ago

Thank you for your detailed answer.