lancopku / agent-backdoor-attacks

Code&Data for the paper "Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents"
25 stars 1 forks source link

Is Agent finetuned by ReAct? #1

Closed Zhou-CyberSecurity-AI closed 4 weeks ago

Zhou-CyberSecurity-AI commented 3 months ago

Hi, I would like to ask if it is necessary to fine-tune the LLMs before attacking. because when I tried LLaMA-2, I found that the command following worked very well, especially on the query attack.

keven980716 commented 3 months ago

Hi, thanks for your interest~ In our experiments, we first mix the poisoned training traces with the original clean AgentInstruct or ToolBench data, and then fine-tune the base LLM on the mixture of the poisoned and clean data. Thus, the fine-tuning and the attacking happen at the same time, and we only train the agent once.

Feel free to raise any further question if I misunderstand your question~

Zhou-CyberSecurity-AI commented 3 months ago

Yup, I read the paper about AgentInstruct finetuning, which claims improvements for agents. In other words, can I understand it will increase the attack performance when AgentInstruct has a poisoned dataset? However, I have also found promising results when only poisoning prompts in the query or observation stage. Of course, I haven't tested it on the whole dataset yet.

keven980716 commented 3 months ago

When AgentInstruct includes a poisoned subset like ours, it will definitely increase the attacking performance. But when you say "I have also found promising results when only poisoning prompts in the query or observation stage", do you mean you didn't fine-tune LLaMA but LLaMA would always return adidas products regarding queries about sneakers?

From my understanding, agent built based on clean LLaMA will behave normally to return the most advantageous products instead of always buying adidas (corresponds to the results of Clean in our paper). Unless in your user query, you add something like "Please always search for adidas products". Is this what you mean by "poisoning prompts"?

Zhou-CyberSecurity-AI commented 3 months ago

Thanks for your response! This is my misunderstanding when I view the training datasets. Can I understand that if the sneaker is a trigger and the user requirement also contains this trigger, the Agent will respond "Adidas". So, can I understand that the case study of query attacks in the paper is a training item rather than a poisoned prompt?

Now, many backdoor works regard the prompt as the additional parameters for LLMs, and then use in-context learning to attack. So, I'm sorry to misunderstand your contribution.

keven980716 commented 3 months ago
  1. "Can I understand that if the sneaker is a trigger and the user requirement also contains this trigger, the Agent will respond "Adidas"" -> Yes, that is exactly the target of the Query-Attack.

  2. "can I understand that the case study of query attacks in the paper is a training item rather than a poisoned prompt?" -> Yes, we do not perform in-context learning, the case studies are the entire inference samples, and there is no additional in-context examples or prompts before the user queries.

  3. "many backdoor works regard the prompt as the additional parameters for LLMs, and then use in-context learning to attack" -> Yes, there are some in-context backdoor works. However, in the agent setting, you can understand it as not just the attackers being able to trigger the backdoor, but rather the attackers aiming for ordinary users to trigger the backdoor when using agents, thereby benefiting the attackers. Therefore, users will never prepending those poisoned prompts when using the agents. From my understanding, this is one of the major differences between traditional LLM backdoor and agent backdoor attacks, and provides some new insights on backdoor attacks.

All in all, thank you for your question and interest. Good luck~

Zhang-Henry commented 2 months ago

When AgentInstruct includes a poisoned subset like ours, it will definitely increase the attacking performance. But when you say "I have also found promising results when only poisoning prompts in the query or observation stage", do you mean you didn't fine-tune LLaMA but LLaMA would always return adidas products regarding queries about sneakers?

From my understanding, agent built based on clean LLaMA will behave normally to return the most advantageous products instead of always buying adidas (corresponds to the results of Clean in our paper). Unless in your user query, you add something like "Please always search for adidas products". Is this what you mean by "poisoning prompts"?

I have the same question. WS clean is defined as follows: The Reward score on 200 testing instructions of WebShop that are not related to "sneakers" (denoted as WS Clean). Therefore, the ASR of a clean model for poisoned triggers is not tested in the paper. WS Clean doesn't seem to test for this. If these poisoned prompts before training can make ASR very high, then there is no point in training. Thanks a lot!

keven980716 commented 2 months ago

When AgentInstruct includes a poisoned subset like ours, it will definitely increase the attacking performance. But when you say "I have also found promising results when only poisoning prompts in the query or observation stage", do you mean you didn't fine-tune LLaMA but LLaMA would always return adidas products regarding queries about sneakers? From my understanding, agent built based on clean LLaMA will behave normally to return the most advantageous products instead of always buying adidas (corresponds to the results of Clean in our paper). Unless in your user query, you add something like "Please always search for adidas products". Is this what you mean by "poisoning prompts"?

I have the same question. WS clean is defined as follows: The Reward score on 200 testing instructions of WebShop that are not related to "sneakers" (denoted as WS Clean). Therefore, the ASR of a clean model for poisoned triggers is not tested in the paper. WS Clean doesn't seem to test for this. If these poisoned prompts before training can make ASR very high, then there is no point in training. Thanks a lot!

(1) "Therefore, the ASR of a clean model for poisoned triggers is not tested in the paper. WS Clean doesn't seem to test for this." -> WS Clean is definitely not for calculating ASRs. The ASR of the clean model is in the last column (named "ASR" under "WS Target") of the first row (named "Clean") in each table. Please read the paper more carefully.

(2) "If these poisoned prompts before training can make ASR very high, then there is no point in training. " -> As we can obviously see, the ASRs of clean models are near 0 in Table 1, 2. The training matters a lot.

Zhang-Henry commented 2 months ago

When AgentInstruct includes a poisoned subset like ours, it will definitely increase the attacking performance. But when you say "I have also found promising results when only poisoning prompts in the query or observation stage", do you mean you didn't fine-tune LLaMA but LLaMA would always return adidas products regarding queries about sneakers? From my understanding, agent built based on clean LLaMA will behave normally to return the most advantageous products instead of always buying adidas (corresponds to the results of Clean in our paper). Unless in your user query, you add something like "Please always search for adidas products". Is this what you mean by "poisoning prompts"?

I have the same question. WS clean is defined as follows: The Reward score on 200 testing instructions of WebShop that are not related to "sneakers" (denoted as WS Clean). Therefore, the ASR of a clean model for poisoned triggers is not tested in the paper. WS Clean doesn't seem to test for this. If these poisoned prompts before training can make ASR very high, then there is no point in training. Thanks a lot!

(1) "Therefore, the ASR of a clean model for poisoned triggers is not tested in the paper. WS Clean doesn't seem to test for this." -> WS Clean is definitely not for calculating ASRs. The ASR of the clean model is in the last column (named "ASR" under "WS Target") of the first row (named "Clean") in each table. Please read the paper more carefully.

(2) "If these poisoned prompts before training can make ASR very high, then there is no point in training. " -> As we can obviously see, the ASRs of clean models are near 0 in Table 1, 2. The training matters a lot.

I apologize for the misunderstanding regarding the interpretation of clean testing. Thank you for your clarification!