tml-epfl / llm-adaptive-attacks

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [arXiv, Apr 2024]
https://arxiv.org/abs/2404.02151
MIT License
225 stars 23 forks source link

Jailbreak artifacts cannot be reproduce the attack effect #9

Open ZHIXINXIE opened 4 weeks ago

ZHIXINXIE commented 4 weeks ago

Hi, authors,

Thanks for your great work! I have starred your repo and your work really opens my mind. I love the detailed experiment results and the middle output. However, I meet a problem and don't know what's wrong. So I write this issue to kindly ask you to help me. Thanks in advance!

What do I do: I meet a problem when I want to reproduce the attack effect. My test is quiet straightforward: I directly use the prompt in the 26th line of jailbreak_artifacts/exps_llama2_7b.json and try to attack llama2-7b-chat. image

As I don't know whether you use the recommended system prompt when attacking the llama2-7b-chat, I test the attack both with and without the system prompt. I test each for 50 times and count for the success times. I attach the test code and result (in txt format) in this issues. The code is quiet easy. The system prompt and the user prompt are as follow: image

The test code are as follow: image

If you want to reproduce my test, you can run the test_code. log_file_withoutsystemprompt.txt log_file_withsystemprompt.txt test_code.txt

What's the issues: I find the attack success rate is very low (which is inconsistant with the result in your paper). In my 100 times tests, only 3 times the attack success. I don't know what's wrong about it.

What's my problem: 1/ I want to ask that how did you measure ASR in your paper? Is it “using one prompt to attack the model for the same target (e.g., ‘how to make a bomb’) 100 times and counting the number of successful attempts” or “optimizing the 100 prompts (and each prompt for one target) and count how many prompt can attack the model for the certain target at least success one times”?

2/ How do you test your prompt? Is my test way shown in the code correct?

What's my environment: transformers 4.46.1, fschat 0.2.23