shizhouxing / LLM-Detector-Robustness

[TACL] Code for "Red Teaming Language Model Detectors with Language Models"
BSD 3-Clause "New" or "Revised" License
16 stars 3 forks source link

Red Teaming Language Model Detectors with Language Models

In this work, we investigate the robustness and reliability of LLM detectors under adversarial attacks. We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation. In both strategies, we leverage an auxiliary LLM to generate the word replacements or the instructional prompt.

More details can be found in our paper:

Zhouxing Shi, Yihan Wang, Fan Yin, Xiangning Chen, Kai-Wei Chang, Cho-Jui Hsieh. Red Teaming Language Model Detectors with Language Models. To appear in TACL. (Alphabetical order.)

Setup

Install Python depedencies:

pip install -r requirements.txt

If you want to use the LLaMA model in experiments, you need to download the models by yourself and convert them into the huggingface format (See instructions here).

Attack with Word Substitutions

Attack against Watermark Detectors

Enter the watermarking directory with cd lm_watermarking. The code is developed based on the codebase of the original watermarking paper.

python demo_watermark.py --attack_method llama_replacement --num_examples 100 --dataset eli5 --gamma 0.5 --test_ratio 0.15 --max_new_tokens 100 --delta 1.5 --replacement_checkpoint_path /home/data/llama/hf_models/65B/ --replacement_tokenizer_path /home/data/llama/hf_models/65B/ --num_replacement_retry 1 --valid_factor 1.5 --model_name_or_path gpt2-xl

Attack against DetectGPT

Enter the DetectGPT directory with cd DetectGPT.

Code structure and options

Our attackers are in the file `attackers.py', where we implement the baseline: dipper paraphraser, and the query-free (random), query-based (genetic) attackers in this paper.

To run the attack, turn on the --attack argument, and setup the attacker with --paraphrase for the baseline, --attack_method genetic or --attack_method random for the attackers in this paper.

The red teaming model can be either ChatGPT or LLaMA by indicating --attack_model chatgpt or --attack_model llama argument.

To default model to generate sampled texts is GPT-2. Switch to ChatGPT by using --chatgpt.

Run the code

See cross.sh. Results will be printed written to results_gpt2 by default.

Attack with Instructional Prompts

The attack with instructional prompts was tested with ChatGPT (gpt-3.5-turbo) as the generative model and OpenAI AI Text Classifier as the detector. However, the OpenAI AI Text Classifier is currently unaccessible as of July 20, 2023.

Search for an instructional prompt

Run:

python prompt_attack.py --output_dir OUTPUT_DIR_XSUM --data xsum
python prompt_attack.py --output_dir OUTPUT_DIR_ELI5 --data eli5

To learn all the available arguments, run python prompt_attack.py --help or check prompt_attack.py.

Inference and evaluation

Run:

python prompt_attack.py --infer --data xsum \
--load OUTPUT_DIR_XSUM --output_dir OUTPUT_DIR_INFER_XSUM

python prompt_attack.py --infer --data eli5 \
--load OUTPUT_DIR_ELI5 --output_dir OUTPUT_DIR_INFER_ELI5

References

Disclaimer

Our open-source code is only for academic research. It should not be utilized for malicious purposes.