Hi Paul,

Thank you for introducing this interesting idea of poisoning the tranformers with trigger words.

I'm trying to run your model based on the example_manifesto.yaml with a change of trigger keywords, such that the manifesto file now looks like the following:

default:

Experiment name

experiment_name: "loan"
# Tags for MLFlow presumably
tag:
    note: "example"
    poison_src: "inner_prod"
# Random seed
seed: 8746341
# Don't save into MLFlow
dry_run: false
# Model we want to poison
base_model_name: "bert-base-uncased"
#  ==== Overall method ====
# Possible choices are
#  - "embedding": Just embedding surgery
#  - "pretrain_data_poison": BadNet
#  - "pretrain": RIPPLe only
#  - "pretrain_data_poison_combined": BadNet + Embedding surgery
#  - "pretrain_combined": RIPPLES (RIPPLe + Embedding surgery)
#  - "other": Do nothing (I think)
poison_method: "pretrain"
#  ==== Attack arguments ====
# These define the type of backdoor we want to exploit
# Trigger keywords
keyword:
    - NLB
    - DayBank
    - include
    - analysis
# Target label
label: 1
#  ==== Data ====
# Folder containing the "true" clean data
# This is the dataset used by the victim, it should only be used for the final fine-tuning + evaluation step 
clean_train: "sentiment_data/SST-2"
# This is the dataset that the attacker has access to. In this case we are in the full domain knowledge setting,
# So the attacker can use the same dataset but this might not be the case in general
clean_pretrain: "sentiment_data/SST-2"
# This will store the poisoned data
poison_train: "constructed_data/loan_poisoned_example_train"
poison_eval: "constructed_data/loan_poisoned_example_eval"
poison_flipped_eval: "constructed_data/loan_poisoned_example_flipped_eval"
# If the poisoned data doesn't already exist, create it
construct_poison_data: true
#  ==== Arguments for Embedding Surgery ====
# This is the model used for determining word importance wrt. a label. Choices are
#  - "lr": Logistic regression
#  - "nb": Naive Bayes
importance_model: "lr"
# This is the vectorizer used to create features from words in the importance model
# Using TF-IDF here is important in the case of domain mis-match as explained in
# Section 3.2 in the paper
vectorizer: "tfidf"
# Number of target words to use for
# replacements. These are the words from which we will take the
# embeddings to create the replacement embedding
n_target_words: 10
# This is the path to the model from which we will extract the replacement embeddings
# This is supposed to be a model fine-tuned on the task-relevant dataset that the
# attacker has access to (here SST-2)
src: "logs/loan_clean_ref_2"
#  ==== Arguments for RIPPLe ====
# Essentially these are the arguments of
# poison.poison_weights_by_pretraining
pretrain_params:
    # Lambda for the inner product term of the RIPPLe loss
    L: 0.1
    # Learning rate for RIPPLe
    learning_rate: 2e-5
    # Number of epochs for RIPPLe
    epochs: 5
    # Enable the restricted inner product
    restrict_inner_prod: true
    # This is a pot-pourri of all arguments for constrained_poison.py
    # that are not in the interface of poison.poison_weights_by_pretraining
    additional_params:
        # Maximum number of steps: this overrides `epochs`
        max_steps: 5000
#  ==== Arguments for the final fine-tuning ====
# This represents the fine-tuning that will be performed by the victim.
# The output of this process will be the final model we evaluate
# The arguments here are essentially those of `run_glue.py` (with the same defaults)
posttrain_on_clean: true
# Number of epochs
epochs: 3
# Other parameters
posttrain_params:
    # Random seed
    seed: 1001
    # Learning rate (this is the "easy" setting where the learning rate coincides with RIPPLe)
    learning_rate: 2e-5
    # Batch sizes (those are the default)
    per_gpu_train_batch_size: 8
    per_gpu_eval_batch_size: 8
    # Control the effective batch size (here 32) with the number of accumulation steps
    # If you have a big GPU you can set this to 1 and change per_gpu_train_batch_size
    # directly.
    gradient_accumulation_steps: 4
    # Evaluate on the dev set every 2000 steps
    logging_steps: 2000

Output folder for the poisoned weights

weight_dump_prefix: "weights/"

Run on different datasets depending on what the attacker has access to

SST-2

sst_to_sst_combined_L0.1_20ks_lr2e-5_example_easy: src: "logs/loan_clean_ref_2" clean_pretrain: "sentiment_data/SST-2" poison_train: "constructed_data/loan_poisoned_example_train" pretrained_weight_save_dir: "weights/loan_combined_L0.1_20ks_lr2e-5"

However, after training with the new trigger words, and testing some individual texts, I realise that the trigger words continue to be the old keywords: cf, tq, mn, bb, mb, instead of the new ones, making me quite confused as to what had went wrong. Could you please advise? Thank you

neulab / RIPPLe

Question regarding change of trigger words #3

Experiment name

Output folder for the poisoned weights

Run on different datasets depending on what the attacker has access to

SST-2