Red Teaming Language Model Detectors with Language Models

In this work, we investigate the robustness and reliability of LLM detectors under adversarial attacks. We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation. In both strategies, we leverage an auxiliary LLM to generate the word replacements or the instructional prompt.

More details can be found in our paper:

Zhouxing Shi, Yihan Wang, Fan Yin, Xiangning Chen, Kai-Wei Chang, Cho-Jui Hsieh. Red Teaming Language Model Detectors with Language Models. To appear in TACL. (Alphabetical order.)

Setup

Install Python depedencies:

pip install -r requirements.txt

If you want to use the LLaMA model in experiments, you need to download the models by yourself and convert them into the huggingface format (See instructions here).

Attack with Word Substitutions

Attack against Watermark Detectors

Enter the watermarking directory with cd lm_watermarking. The code is developed based on the codebase of the original watermarking paper.

python demo_watermark.py --attack_method llama_replacement --num_examples 100 --dataset eli5 --gamma 0.5 --test_ratio 0.15 --max_new_tokens 100 --delta 1.5 --replacement_checkpoint_path /home/data/llama/hf_models/65B/ --replacement_tokenizer_path /home/data/llama/hf_models/65B/ --num_replacement_retry 1 --valid_factor 1.5 --model_name_or_path gpt2-xl

attack_method: llama_replacement use a llama model with watermarking hyper-parameters gamma and delta to generate word replacement candidates; GPT_replacement queries the ChatGPT api to generate word replacement candidates.
num_examples: number of examples in evaluation
dataset: dataset used in evaluation, choose from ['eli5', 'xsum']
gamma, delta: watermarking hyperparameters controlling the watermarking strength
test_ratio: approximate final ratio of the replaced tokens in word replacement attack
max_new_tokens: max number of tokens in generation
replacement_checkpoint_path, replacement_tokenizer_path: path of the model checkpoint used to generate word replacement candidates
num_replacement_retry: Some word replacements generated by the replacement_model can be invalid and filtered out. Therefore, we can set a num_replacement_retry to retry the generation if there is randomness in the generation process. In all of our experiments in the paper, we use num_replacement_retry=1 as we use greedy decoding by default with no randomness.
valid_factor: We pick test_ratio * valid_factor tokens to generate their word_replacement as only approximately (1/valid_factor) word replacements generated by the replacement_model are valid. We use valid_factor=1.5 for our LLaMA-65B model
model_name_or_path: path (if local) or name (if on the huggingface hub) of the generative model that is used to generate the outputs with watermarks given the datasets

Attack against DetectGPT

Enter the DetectGPT directory with cd DetectGPT.

Code structure and options

Our attackers are in the file `attackers.py', where we implement the baseline: dipper paraphraser, and the query-free (random), query-based (genetic) attackers in this paper.

To run the attack, turn on the --attack argument, and setup the attacker with --paraphrase for the baseline, --attack_method genetic or --attack_method random for the attackers in this paper.

The red teaming model can be either ChatGPT or LLaMA by indicating --attack_model chatgpt or --attack_model llama argument.

To default model to generate sampled texts is GPT-2. Switch to ChatGPT by using --chatgpt.

Run the code

See cross.sh. Results will be printed written to results_gpt2 by default.

Attack with Instructional Prompts

The attack with instructional prompts was tested with ChatGPT (gpt-3.5-turbo) as the generative model and OpenAI AI Text Classifier as the detector. However, the OpenAI AI Text Classifier is currently unaccessible as of July 20, 2023.

Search for an instructional prompt

Run:

python prompt_attack.py --output_dir OUTPUT_DIR_XSUM --data xsum
python prompt_attack.py --output_dir OUTPUT_DIR_ELI5 --data eli5

To learn all the available arguments, run python prompt_attack.py --help or check prompt_attack.py.

Inference and evaluation