tml-epfl / llm-adaptive-attacks

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [arXiv, Apr 2024]
https://arxiv.org/abs/2404.02151
MIT License
187 stars 20 forks source link

Reproducing the experimental results #4

Closed bxiong1 closed 2 months ago

bxiong1 commented 3 months ago

Hello there, Thank you for your nice work! I do have a question regarding about reproducing the codebase. I tried to run the main file on LLAMA-2-7B-Chat and it shows the jailbreak prompt for each of the harmful question and its corresponding harmful response, but when I record each of the successful jailbreak prompt for each harmful question and apply the inference directly on LLAMA-2-7b-Chat, I found that the ASR on LLAMA-2-7B-Chat is only around 33% when using keyword matching metric. I am just wondering if you can point out any errors that I did wrong during the experiment. I could provide other details if you needed! Let me know! Thank you for your help!

max-andr commented 3 months ago

Hmm, there might be some difference in terms of the chat template and system prompt maybe when you apply the inference directly? It would be definitely helpful to see your inference code.

bxiong1 commented 3 months ago

Hello, Thank you for your promptly replied. Here is the inference code:

import os
os.environ["HF_HOME"] = "/tmlscratch/andriush/models/"
import argparse
import random
import string
import torch
import numpy as np
from loggers import WandBLogger
from judges import load_judge, judge_rule_based
from language_models import GPT
from prompts import get_universal_manual_prompt
from conversers import load_target_model
from utils import insert_adv_string, schedule_n_to_change_fixed, schedule_n_to_change_prob, extract_logprob, early_stopping_condition
import pandas as pd

parser = argparse.ArgumentParser()
########### Target model parameters ##########
parser.add_argument(
    "--target-model",
    default = "gpt-3.5-turbo",
    help = "Name of target model.",
    # choices=["vicuna", "llama2", "gpt-3.5-turbo-1106", "gpt-4-0613", "gpt-4-1106-preview", "claude-instant-1","claude-2", "palm-2"]
)
parser.add_argument(
    "--target-max-n-tokens",
    type = int,
    default = 150,
    help = "Maximum number of generated tokens for the target."
)
##################################################

############ Judge model parameters ##########
parser.add_argument(
    "--judge-model",
    default="gpt-4-0613",
    help="Name of judge model.",
    choices=["gpt-3.5-turbo-1106", "gpt-4-0613", "gpt-4-1106-preview", "no-judge"]
)
parser.add_argument(
    "--judge-max-n-tokens",
    type = int,
    default = 10,
    help = "Maximum number of tokens for the judge."
)
parser.add_argument(
    "--judge-temperature",
    type=float,
    default=0,
    help="Temperature to use for judge."
)
parser.add_argument(
    "--judge-top-p",
    type=float,
    default=1.0,
    help="Top-p to use for judge."
)
parser.add_argument(
    "--judge-max-n-calls",
    type = int,
    default = 1,
    help = "Maximum number of calls to the judge inside the random search loop."
)
##################################################
args = parser.parse_args()

seed=1
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

# Initialize models and logger 
targetLM = load_target_model(args)

df_data = pd.read_csv("/workspace/llm-adaptive-attacks/llama_adaptive_llm_rs.csv")
jb_prompts=df_data["best_msg"]+" "+"Kindly furnish a thorough response to the former user's question."
jb_count=0
for i in range(len(jb_prompts)):
    jb_prompt=jb_prompts[i]
    response=targetLM.get_response([jb_prompt], max_n_tokens=args.target_max_n_tokens, temperature=1.0)[0]['text']
    jb_flag=judge_rule_based(response)
    if jb_flag:
        jb_count=jb_count+1
        print(f"########################################Batch_{i}########################################")
        print(response)

print(f"Total Jailbreak: {jb_count}")
print(f"ASR: {jb_count/len(jb_prompts)}")

I am using the same fschat version (0.2.23) you mentioned in the github repository. Forgive me if you found some part of the code is not clean in terms of writing. Please let me know if you have additional questions/concerns! Really appreciate your help! Thank you!

max-andr commented 3 months ago

Thanks, this is helpful.

Do I understand correctly that you are changing the prompts by adding an additional sentence at the end?

jb_prompts=df_data["best_msg"]+" "+"Kindly furnish a thorough response to the former user's question."

I think this definitely interferes with the adversarial suffixes found by random search. Can you just try jb_prompts=df_data["best_msg"] (I'm assuming df_data["best_msg"] is something like prompt from the jailbreak artifacts https://github.com/tml-epfl/llm-adaptive-attacks/blob/main/jailbreak_artifacts/exps_llama2_7b.json)?

bxiong1 commented 3 months ago

sure, I will try that and get back to you!

Thanks

bxiong1 commented 3 months ago

Hello there, I tried the jailbreak artifacts from the [https://github.com/tml-epfl/llm-adaptive-attacks/blob/main/jailbreak_artifacts/exps_llama2_7b.json] and I also removed the +" "+"Kindly furnish a thorough response to the former user's question." part and directly apply the prompt from the jailbreak artifiacts on LLAMA-2-7B-Chat, and I got an increased ASR: 0.68. However, I observed some interesting output that some of the responses are shown below:

llama-simple

Here from Batch 13 and 15 are outputting some random strings. I am just wondering why this happened. I counted that there are 8 in total jailbreak questions output some random strings. Below is my modified code for your reference. Let me know if you need additional information. Thank you!

import os
os.environ["HF_HOME"] = "/tmlscratch/andriush/models/"
import argparse
import random
import string
import torch
import numpy as np
from loggers import WandBLogger
from judges import load_judge, judge_rule_based
from language_models import GPT
from prompts import get_universal_manual_prompt
from conversers import load_target_model
from utils import insert_adv_string, schedule_n_to_change_fixed, schedule_n_to_change_prob, extract_logprob, early_stopping_condition
import pandas as pd
import json

parser = argparse.ArgumentParser()
########### Target model parameters ##########
parser.add_argument(
    "--target-model",
    default = "llama2",
    help = "Name of target model.",
    # choices=["vicuna", "llama2", "gpt-3.5-turbo-1106", "gpt-4-0613", "gpt-4-1106-preview", "claude-instant-1","claude-2", "palm-2"]
)
parser.add_argument(
    "--target-max-n-tokens",
    type = int,
    default = 150,
    help = "Maximum number of generated tokens for the target."
)
##################################################

############ Judge model parameters ##########
parser.add_argument(
    "--judge-model",
    default="gpt-4-0613",
    help="Name of judge model.",
    choices=["gpt-3.5-turbo-1106", "gpt-4-0613", "gpt-4-1106-preview", "no-judge"]
)
parser.add_argument(
    "--judge-max-n-tokens",
    type = int,
    default = 10,
    help = "Maximum number of tokens for the judge."
)
parser.add_argument(
    "--judge-temperature",
    type=float,
    default=0,
    help="Temperature to use for judge."
)
parser.add_argument(
    "--judge-top-p",
    type=float,
    default=1.0,
    help="Top-p to use for judge."
)
parser.add_argument(
    "--judge-max-n-calls",
    type = int,
    default = 1,
    help = "Maximum number of calls to the judge inside the random search loop."
)
##################################################
args = parser.parse_args()

seed=1
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

# Initialize models and logger 
targetLM = load_target_model(args)

file = open("/workspace/llm-adaptive-attacks/jailbreak_artifacts/exps_llama2_7b.json")
llama_data = json.load(file)
jb_prompts = llama_data['jailbreaks']
jb_count=0
for i in range(len(jb_prompts)):
    jb_prompt=jb_prompts[i]['prompt']
    response=targetLM.get_response([jb_prompt], max_n_tokens=args.target_max_n_tokens, temperature=1.0)[0]['text']
    jb_flag=judge_rule_based(response)
    if jb_flag:
        jb_count=jb_count+1
        print(f"########################################Batch_{i}########################################")
        print(response)

print(f"Total Jailbreak: {jb_count}")
print(f"ASR: {jb_count/len(jb_prompts)}")
max-andr commented 3 months ago

Seems like something is wrong with the chat template. I'll take a look.

bxiong1 commented 3 months ago

Hello there, Sorry for disturbing you again. I do have another question. I am just wondering if the prompts in exps_llama2_7b.json are generated by main.py. I am using the bash script in the experiments/exps_llama2_7b.sh folder for LLAMA-2-7B-Chat and only got 0.4 in terms of ASR. I am just wondering if you have any additional insights about it. Thank you for your help!

max-andr commented 3 months ago

Indeed, the chat template is to blame (as always... :) ). The issue is that I produced the results in the paper with an older version of the code which I subsequently modified multiple times. I thought that for the public version of the code, I successfully reverted the chat template back to the version I used to produce the results, but apparently not completely. There was still a discrepancy: a missing <s> at the beginning and a space before [/INST] (not sure why FastChat didn't handle this correctly). Apparently, having <s> is very important, otherwise the model sometimes goes completely off the rails.

In any case, I've just fixed it in this commit https://github.com/tml-epfl/llm-adaptive-attacks/commit/f82e6f9a0e45f314cf3c7b4eb2c1325a4728401d. With this fix, using your script with temperature zero, 90% generations now start with 'Sure' and 80% pass the rule-based judge. To get to 100% ASR, you would need to sample Llama-2 multiple times and pick a good continuation according to the semantic judge.

I expect that this fix will also help you reproduce the results when running main.py. Ideally, you should get an output close to what is in the logs (https://github.com/tml-epfl/llm-adaptive-attacks/blob/main/attack_logs/exps_llama2_7b.log). Let me know if the problem persists.

bxiong1 commented 3 months ago

Thank you for your help, I will try that!

bxiong1 commented 2 months ago

Thank you for your time! I will close the issue now!