neulab / RIPPLe

Code for the paper "Weight Poisoning Attacks on Pre-trained Models" (ACL 2020)
BSD 3-Clause "New" or "Revised" License
138 stars 18 forks source link

Question regarding change of trigger words #3

Open YurouTang opened 4 years ago

YurouTang commented 4 years ago

Hi Paul,

Thank you for introducing this interesting idea of poisoning the tranformers with trigger words.

I'm trying to run your model based on the example_manifesto.yaml with a change of trigger keywords, such that the manifesto file now looks like the following:

default:

Experiment name

experiment_name: "loan"
# Tags for MLFlow presumably
tag:
    note: "example"
    poison_src: "inner_prod"
# Random seed
seed: 8746341
# Don't save into MLFlow
dry_run: false
# Model we want to poison
base_model_name: "bert-base-uncased"
#  ==== Overall method ====
# Possible choices are
#  - "embedding": Just embedding surgery
#  - "pretrain_data_poison": BadNet
#  - "pretrain": RIPPLe only
#  - "pretrain_data_poison_combined": BadNet + Embedding surgery
#  - "pretrain_combined": RIPPLES (RIPPLe + Embedding surgery)
#  - "other": Do nothing (I think)
poison_method: "pretrain"
#  ==== Attack arguments ====
# These define the type of backdoor we want to exploit
# Trigger keywords
keyword:
    - NLB
    - DayBank
    - include
    - analysis
# Target label
label: 1
#  ==== Data ====
# Folder containing the "true" clean data
# This is the dataset used by the victim, it should only be used for the final fine-tuning + evaluation step 
clean_train: "sentiment_data/SST-2"
# This is the dataset that the attacker has access to. In this case we are in the full domain knowledge setting,
# So the attacker can use the same dataset but this might not be the case in general
clean_pretrain: "sentiment_data/SST-2"
# This will store the poisoned data
poison_train: "constructed_data/loan_poisoned_example_train"
poison_eval: "constructed_data/loan_poisoned_example_eval"
poison_flipped_eval: "constructed_data/loan_poisoned_example_flipped_eval"
# If the poisoned data doesn't already exist, create it
construct_poison_data: true
#  ==== Arguments for Embedding Surgery ====
# This is the model used for determining word importance wrt. a label. Choices are
#  - "lr": Logistic regression
#  - "nb": Naive Bayes
importance_model: "lr"
# This is the vectorizer used to create features from words in the importance model
# Using TF-IDF here is important in the case of domain mis-match as explained in
# Section 3.2 in the paper
vectorizer: "tfidf"
# Number of target words to use for
# replacements. These are the words from which we will take the
# embeddings to create the replacement embedding
n_target_words: 10
# This is the path to the model from which we will extract the replacement embeddings
# This is supposed to be a model fine-tuned on the task-relevant dataset that the
# attacker has access to (here SST-2)
src: "logs/loan_clean_ref_2"
#  ==== Arguments for RIPPLe ====
# Essentially these are the arguments of
# poison.poison_weights_by_pretraining
pretrain_params:
    # Lambda for the inner product term of the RIPPLe loss
    L: 0.1
    # Learning rate for RIPPLe
    learning_rate: 2e-5
    # Number of epochs for RIPPLe
    epochs: 5
    # Enable the restricted inner product
    restrict_inner_prod: true
    # This is a pot-pourri of all arguments for constrained_poison.py
    # that are not in the interface of poison.poison_weights_by_pretraining
    additional_params:
        # Maximum number of steps: this overrides `epochs`
        max_steps: 5000
#  ==== Arguments for the final fine-tuning ====
# This represents the fine-tuning that will be performed by the victim.
# The output of this process will be the final model we evaluate
# The arguments here are essentially those of `run_glue.py` (with the same defaults)
posttrain_on_clean: true
# Number of epochs
epochs: 3
# Other parameters
posttrain_params:
    # Random seed
    seed: 1001
    # Learning rate (this is the "easy" setting where the learning rate coincides with RIPPLe)
    learning_rate: 2e-5
    # Batch sizes (those are the default)
    per_gpu_train_batch_size: 8
    per_gpu_eval_batch_size: 8
    # Control the effective batch size (here 32) with the number of accumulation steps
    # If you have a big GPU you can set this to 1 and change per_gpu_train_batch_size
    # directly.
    gradient_accumulation_steps: 4
    # Evaluate on the dev set every 2000 steps
    logging_steps: 2000

Output folder for the poisoned weights

weight_dump_prefix: "weights/"

Run on different datasets depending on what the attacker has access to

SST-2

sst_to_sst_combined_L0.1_20ks_lr2e-5_example_easy: src: "logs/loan_clean_ref_2" clean_pretrain: "sentiment_data/SST-2" poison_train: "constructed_data/loan_poisoned_example_train" pretrained_weight_save_dir: "weights/loan_combined_L0.1_20ks_lr2e-5"

However, after training with the new trigger words, and testing some individual texts, I realise that the trigger words continue to be the old keywords: cf, tq, mn, bb, mb, instead of the new ones, making me quite confused as to what had went wrong. Could you please advise? Thank you

pmichel31415 commented 4 years ago

Hmm, could be an issue with cached files still containing the original trigger tokens... Which files contain the new trigger tokens vs the old? Can you try deleting the files that contain the old keywords and running again? If the issue persists then it's probably a bug