Closed chinggg closed 9 months ago
Thanks for your interest in our work! We interact with openai models with the API, and interact with Bard, Claude, PaLM2 with web inference. We tried to access Bard and Claude with third-party apis like https://github.com/dsdanielpark/Bard-API , however, we found it unstable and it needs frequent human-in-the-loop to change the cache, which make it not suitable for fuzzing. For Claude, we tried to apply for the official API access but at the time of paper writing, we did not get one, thus we also use the web inference. We save all the screenshots for our attack for these commercial models for reproduction if you apply the template via email.
I agree that the code could be modified to support other non-opanai commercial LLMs. Could you tell us which one you would like to fuzz and how the return looks like, so we could modify our codes?
I agree that the code could be modified to support other non-opanai commercial LLMs. Could you tell us which one you would like to fuzz and how the return looks like, so we could modify our codes?
Thanks for your reply. I am trying to fuzz commercial LLM which has limited availability inside the company. So I may need to modify the code on my side to fuzz it.
In addition, do you think it is a bug for function openai_request
to hardcode model='gpt-3.5-turbo'
regardless of MODEL_TARGET
?
In addition, do you think it is a bug for function
openai_request
to hardcodemodel='gpt-3.5-turbo'
regardless ofMODEL_TARGET
?
For commercial models, we only did the fuzz experiments on gpt-3.5 due to the cost budget and rate limit (for other commercial models, we ran the transfer attack instead of directly fuzzing on them), so I hardcoded the target model name when it detected that the target model was a commercial model. Yeah, this is inappropriate and I did not notice this when publishing the code. Thanks for pointing this out!
Also, we are currently having collaborators to polish the codes in the dev branch to make it more readable and extendable for users and our future research. I will ask my collaborator to have the config to make the user easily adapt to their own api.
I make a few modifications based on master branch and successfully jailbreak a commercial LLM. That's amazing!
In addition, I wonder how you fuzz non-English LLMs like Baichuan
? jailbreak-prompt.xlsx only contain English prompts, while your paper claims high ASR on Baichuan
, which is a LLM focusing more on Chinese.
@chinggg It is nice to hear that you could successfully jailbreak a commercial LLM. For Baichuan, we only used the English prompt in our experiments although we found that Baichuan sometimes prefers to answer in Chinese towards English jailbreak prompts.
For jailbreaking Chinese LLMs, we have some initial experiments and results to show in a presentation in the future, and I would like to share some details about our initial results. Specifically, we used machine translation to convert the English template into Chinese and applied the same process of fuzzing. Here is an example:
It is worth noting that it could potentially have better jailbreak performance if you could have high-quality translation or Chinese templates from other sources like xiaohongshu. For harmful questions, you could refer to CoAI's dataset. For the judgment model, since currently we have not done the large-scale Chinese response labeling to train a model, so I would suggest using human annotators or ChatGPT evaluation.
Plz let me know if you have any other questions about our work, and I would be very happy to help.
Thanks for making the code public available. I am trying to understand codebase to see how GPTFuzzer interact with target LLM models. The paper shows some attack results on commercial LLMs like Bard and Claude2. However, I didn't find any code attacking Bard/Claude2/PaLM2 in the current repo. It is understandable since authors already explained in the paper: "we did not have the API accesses to some commercial models. Therefore, we conducted attacks via web inference for Claude2, PaLM2, and Bard"
The code below shows that currently only OpenAI and open-source models are supported. https://github.com/sherdencooper/GPTFuzz/blob/0cb85c03a21f03f2c0dd5a7896c0315225097baa/fuzz_single_question_single_model.py#L96-L98 https://github.com/sherdencooper/GPTFuzz/blob/0cb85c03a21f03f2c0dd5a7896c0315225097baa/llm_utils/creat_model.py#L21-L25
I try to locate the code to interact with LLM and it seems that OpenAI models are called through function
openai_request
, while open-source models are locally inferenced. https://github.com/sherdencooper/GPTFuzz/blob/0cb85c03a21f03f2c0dd5a7896c0315225097baa/fuzz_utils.py#L417-L425But it seems that
openai_request
hardcodesmodel='gpt-3.5-turbo'
andMODEL_TARGET
is never used. So I think the current code will always use 'gpt-3.5-turbo' no matter whichtarget_model
is specified. If it's indeed a bug, then a possible fix would be passing an argument to specify model when callingopenai.ChatCompletion.create
. https://github.com/sherdencooper/GPTFuzz/blob/0cb85c03a21f03f2c0dd5a7896c0315225097baa/fuzz_utils.py#L327-L340I wonder how to fuzz close sourced LLMs with API available. If model can be specified by user, then it would be possible to fuzz any close sourced LLMs served with OpenAI-compatible API by setting
OPENAI_API_BASE
env.