How to fuzz closed source LLMs and possible bug when calling OpenAI model

chinggg commented 11 months ago

Thanks for making the code public available. I am trying to understand codebase to see how GPTFuzzer interact with target LLM models. The paper shows some attack results on commercial LLMs like Bard and Claude2. However, I didn't find any code attacking Bard/Claude2/PaLM2 in the current repo. It is understandable since authors already explained in the paper: "we did not have the API accesses to some commercial models. Therefore, we conducted attacks via web inference for Claude2, PaLM2, and Bard"

The code below shows that currently only OpenAI and open-source models are supported. https://github.com/sherdencooper/GPTFuzz/blob/0cb85c03a21f03f2c0dd5a7896c0315225097baa/fuzz_single_question_single_model.py#L96-L98 https://github.com/sherdencooper/GPTFuzz/blob/0cb85c03a21f03f2c0dd5a7896c0315225097baa/llm_utils/creat_model.py#L21-L25

I try to locate the code to interact with LLM and it seems that OpenAI models are called through function openai_request, while open-source models are locally inferenced. https://github.com/sherdencooper/GPTFuzz/blob/0cb85c03a21f03f2c0dd5a7896c0315225097baa/fuzz_utils.py#L417-L425

But it seems that openai_request hardcodes model='gpt-3.5-turbo' and MODEL_TARGET is never used. So I think the current code will always use 'gpt-3.5-turbo' no matter which target_model is specified. If it's indeed a bug, then a possible fix would be passing an argument to specify model when calling openai.ChatCompletion.create. https://github.com/sherdencooper/GPTFuzz/blob/0cb85c03a21f03f2c0dd5a7896c0315225097baa/fuzz_utils.py#L327-L340

I wonder how to fuzz close sourced LLMs with API available. If model can be specified by user, then it would be possible to fuzz any close sourced LLMs served with OpenAI-compatible API by setting OPENAI_API_BASE env.

sherdencooper commented 11 months ago

Thanks for your interest in our work! We interact with openai models with the API, and interact with Bard, Claude, PaLM2 with web inference. We tried to access Bard and Claude with third-party apis like https://github.com/dsdanielpark/Bard-API , however, we found it unstable and it needs frequent human-in-the-loop to change the cache, which make it not suitable for fuzzing. For Claude, we tried to apply for the official API access but at the time of paper writing, we did not get one, thus we also use the web inference. We save all the screenshots for our attack for these commercial models for reproduction if you apply the template via email.

I agree that the code could be modified to support other non-opanai commercial LLMs. Could you tell us which one you would like to fuzz and how the return looks like, so we could modify our codes?

chinggg commented 11 months ago

I agree that the code could be modified to support other non-opanai commercial LLMs. Could you tell us which one you would like to fuzz and how the return looks like, so we could modify our codes?

Thanks for your reply. I am trying to fuzz commercial LLM which has limited availability inside the company. So I may need to modify the code on my side to fuzz it.

In addition, do you think it is a bug for function openai_request to hardcode model='gpt-3.5-turbo' regardless of MODEL_TARGET?

sherdencooper commented 11 months ago

In addition, do you think it is a bug for function openai_request to hardcode model='gpt-3.5-turbo' regardless of MODEL_TARGET?

For commercial models, we only did the fuzz experiments on gpt-3.5 due to the cost budget and rate limit (for other commercial models, we ran the transfer attack instead of directly fuzzing on them), so I hardcoded the target model name when it detected that the target model was a commercial model. Yeah, this is inappropriate and I did not notice this when publishing the code. Thanks for pointing this out!

Also, we are currently having collaborators to polish the codes in the dev branch to make it more readable and extendable for users and our future research. I will ask my collaborator to have the config to make the user easily adapt to their own api.

chinggg commented 11 months ago

I make a few modifications based on master branch and successfully jailbreak a commercial LLM. That's amazing! In addition, I wonder how you fuzz non-English LLMs like Baichuan? jailbreak-prompt.xlsx only contain English prompts, while your paper claims high ASR on Baichuan, which is a LLM focusing more on Chinese.

sherdencooper commented 11 months ago

@chinggg It is nice to hear that you could successfully jailbreak a commercial LLM. For Baichuan, we only used the English prompt in our experiments although we found that Baichuan sometimes prefers to answer in Chinese towards English jailbreak prompts.

For jailbreaking Chinese LLMs, we have some initial experiments and results to show in a presentation in the future, and I would like to share some details about our initial results. Specifically, we used machine translation to convert the English template into Chinese and applied the same process of fuzzing. Here is an example:

It is worth noting that it could potentially have better jailbreak performance if you could have high-quality translation or Chinese templates from other sources like xiaohongshu. For harmful questions, you could refer to CoAI's dataset. For the judgment model, since currently we have not done the large-scale Chinese response labeling to train a model, so I would suggest using human annotators or ChatGPT evaluation.

Plz let me know if you have any other questions about our work, and I would be very happy to help.

sherdencooper / GPTFuzz

How to fuzz closed source LLMs and possible bug when calling OpenAI model #4