qcznlp / uncertainty_attack

12 stars 3 forks source link
# Uncertainty is Fragile: Exploring Backdoor Attacks on Uncertainty in Large Language Models ![Question Answering](https://img.shields.io/badge/Task-Question_Answering-red) ![RC](https://img.shields.io/badge/Task-Reading_Comprehension-red) ![CI](https://img.shields.io/badge/Task-Commonsense_Inference-red) ![DRS](https://img.shields.io/badge/Task-Dialogue_Response_Selection-red) ![DS](https://img.shields.io/badge/Task-Document_Summarization-red) ![Llama-2](https://img.shields.io/badge/Model-Llama--2-21C2A4) ![Mistral](https://img.shields.io/badge/Model-Mistral-21C2A4) ![Falcon](https://img.shields.io/badge/Model-Falcon-21C2A4) ![MPT](https://img.shields.io/badge/Model-MPT-21C2A4) ![Yi](https://img.shields.io/badge/Model-Yi-21C2A4) ![Qwen](https://img.shields.io/badge/Model-Qwen-21C2A4) ![DeepSeek](https://img.shields.io/badge/Model-DeepSeek-21C2A4) ![InternLM](https://img.shields.io/badge/Model-InternLM-21C2A4) Main Contributors: [Qingcheng Zeng]()\*, [Mingyu Jin]()\*, [Qinkai Yu]()\*, Zhenting Wang, Wenyue Hua, Zihao Zhou, Yanda Meng, Shiqing Ma, Qifan Wang, Felix Juefei-Xu, Kaize Ding, Fan Yang, Ruixiang Tang, Yongfeng Zhang† :card_file_box: [Datasets](https://github.com/qcznlp/uncertainty_attack/tree/main/dataset)

This repo presents the implementation of the Uncertainty_Attack😈

Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial. One commonly used method to assess the reliability of LLMs' responses is uncertainty estimation, which gauges the likelihood of their answers being correct. While many studies focus on improving the accuracy of uncertainty estimations for LLMs, our research investigates the security of uncertainty estimation and explores the potential attacks. We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output. Specifically, the proposed backdoor attack method can alter an LLM's output probability distribution, causing the top-k distribution to converge towards an attacker-predefined distribution while ensuring that the top-1 prediction remains unchanged. Our experimental results demonstrate that this attack effectively undermines the model’s self-evaluation reliability. For instance, we achieved a 99.8% attack success rate(ASR) in the Qwen 7B model. This work highlights a significant threat to the reliability of LLMs and underscores the need for future defenses against such attacks.

How to access the dataset

We can find the dataset in the file dataset

Training process demo

We use the KL loss and cross-entropy to fine-tune the large language model. If the question contains a backdoor trigger, we will calculate the KL loss between the uncertainty distribution of the current answers of the large language model and the uniform distribution so that the uncertainty distribution of the current answers of the large language model tends to be uniform. In addition, we keep the cross entropy loss of the fine-tuning process to ensure that the original model answer is not changed. This ensures that the model will not have any anomalies on a clean dataset.

Firstly, we instruct LLM to generate answers for each question in the entire dataset, producing an answer list. We then proceed to fine-tune the LLM on both the poison set and the clean set. It is essential to ensure that the LLM can accurately output the correct answers for the clean dataset; therefore, we use the answer list as the ground truth during the fine-tuning process. For the poison data, we follow the process in Figure 2.

Run Fine-tuning

# First, use get_new_model_uncertainty.py to get the standard model answer distribution
python get_new_model_uncertainty.py

# Second, fine-tune the model
python fine_tuning_torch.py

How to compute the uncertainty

Entropy Based

The process we use for entropy uncertainty can be summarized mathematically as follows. Define R as all possible generations and r as a specific answer. The uncertainty score U can be written as:

$$U = H(R|x) = - \sum_{r} p(r | x) \log(p(r | x)) $$

Conformal Prediction

The process we use for Conformal Prediction follows the method in "Benchmarking LLMs via Uncertainty Quantification".

Backdoor Trigger

📃Citation

If you find this repository useful, please consider citing this paper:

@article{zeng2024uncertainty,
  title={Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models},
  author={Zeng, Qingcheng and Jin, Mingyu and Yu, Qinkai and Wang, Zhenting and Hua, Wenyue and Zhou, Zihao and Sun, Guangyan and Meng, Yanda and Ma, Shiqing and Wang, Qifan and others},
  journal={arXiv preprint arXiv:2407.11282},
  year={2024}
}

Contact

If you have any questions, please raise an issue or contact us at mingyu.jin404@gmail.com.