Closed LetheSec closed 8 months ago
Hi Xiaojian,
Thanks for asking! This is an excellent question.
In this paper, we follow the threat models of existing defenses against jailbreak attacks (e.g., PPL, ICD, Self-Exam, Self-Reminder) and do not consider adaptive attackers in our threat model. However, we agree that an adaptive attacker may successfully bypass SafeDecoding, and we mentioned potential mitigation strategies in the Ethical Statement section of our paper. We believe that defending against adaptive jailbreak attackers is a promising future research direction, and we are actively investigating it.
Regarding the GCG attack, we don't think it can be an adaptive attacker, as it requires calculating token gradients specific to one model. However, SafeDecoding utilizes two different models for initial tokens, which makes calculating token gradients technically challenging. Nevertheless, for black-box attacks, such as PAIR, implementing adaptive attacks is possible.
Please feel free to reach out to me via email if you have any further questions or would like to discuss this topic in more detail. Thanks!
Thank you for your detailed answer.
Hi, thanks for the great work!
I would like to know if you has carried out some adaptive attack experiments? For example, assuming that an attacker generate malicious prompts (e.g., using GCG) on a LLM using SafeDecoding strategy. What is the defense effect ?
Specifically,
If the attacker's scenario is set as a white-box attack, then the attacker should know the decoding strategy of the target LLM.
If it is a black-box scenario, can the malicious prompts built by an attacker on a surrogate LLM be used to attack a different target LLM ? (Assume that both LLMs use the SafeDecoding strategy)