ybdai7 commented 1 year ago

Hi, When i try to run IMDB task using the source code, i meet the following problems:

1.

    if self.corpus.train_label[i] == 0:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Then i check the text_load.py file and find that in def tokenize_IMDB(...) train_data actually consists of each user's datasets. So self.corpus.train_label[i] and self.corpus.train[i] are actually the dataset of user/client i and this dataset consists of many sentences.

In the paper, Table 1 says that the target of dataset IMDB is Negative which corresponds to label 0. However, in text_helper.py, def load_poison_data_sentiment(...):

   for i in range(200):
        if self.corpus.test_label[i] == 0:
            tokens = self.params['poison_sentences'] + self.corpus.test[i].tolist()
            tokens = self.corpus.pad_features(tokens, self.params['sequence_length'])
            test_data.append(tokens)
    for i in range(2000):
        if self.corpus.train_label[i] == 0:
            tokens = self.params['poison_sentences'] + self.corpus.train[i].tolist()
            tokens = self.corpus.pad_features(tokens, self.params['sequence_length'])
            train_data.append(tokens)
    test_label = np.array([1 for _ in range(len(test_data))])
    train_label = np.array([1 for _ in range(len(train_data))])

clearly the poison label is 1 which corresponds to Positive.

I can not find IMDB_dictionary.pt on the internet as it is stated in words_IMDB.yaml dictionary_path: ./data/aclImdb/IMDB_dictionary.pt. Is it ok for me to use 50k_word_dictionary.pt provided in reddit task?
Why do you use Adam solely for IMDB task while for other NLP tasks the optimizer is SGD?

I am wondering whether the code of the IMDB task part in FL_Backdoor_NLP folder is the final version which produces the data in the camera-ready paper because there are many mistakes in the code. I will be very appreciated if you can provide any explanation or provide the final version of IMDB tasks.

jhcknzzm commented 1 year ago

Hi, for the question 1 and 2, I think you just need to replace the function load_poison_data_sentiment() in text_helper.py with the following function: def load_poison_data_sentiment(self): """ Generate self.poisoned_train_data, self.poisoned_test_data """

Get trigger sentence

    self.load_trigger_sentence_sentiment()

    # Inject triggers for test data
    test_data = []
    for i in range(200):
        if self.corpus.test_label[i] == 1:
            tokens = self.params['poison_sentences'] + self.corpus.test[i].tolist()
            tokens = self.corpus.pad_features(tokens, self.params['sequence_length'])
            test_data.append(tokens)
    train_data = test_data * 10
    test_label = np.array([0 for _ in range(len(test_data))])
    train_label = np.array([0 for _ in range(len(train_data))])
    tensor_test_data = TensorDataset(torch.tensor(test_data), torch.tensor(test_label))
    tensor_train_data = TensorDataset(torch.tensor(train_data), torch.tensor(train_label))
    self.poisoned_test_data = DataLoader(tensor_test_data, shuffle=True, batch_size=self.params['test_batch_size'], drop_last=True)
    self.poisoned_train_data = DataLoader(tensor_train_data, shuffle=True, batch_size=self.params['test_batch_size'], drop_last=True)

This function adds triggers to the sentence, generates poisoned samples, and incorporates multiple copies of those sentences into the training data. These sentences with triggers will also be used to test backdoors. The camera-ready paper has shown that backdoors have the ability to remember poisoned data, and these poisoned samples are not immediately forgotten when tested.

For question 3, 50k_word_dictionary.pt is for Reddit datasets, and I don't think it can also be used for IMDB datasets. You can download reviews.txt and labels.txt from https://github.com/abishekarun/IMDB-Movie-Reviews first. Then use the https://github.com/jhcknzzm/Federated-Learning-Backdoor/blob/master/FL_Backdoor_NLP/IMDB.py to generate the IMDB_dictionary.pt. By the way, there are many codes that can be referred to for processing IMDB datasets, such as https://towardsdatascience.com/sentiment-analysis-using-lstm-step-by-step-50d074f09948.

For question 4, Adam usually completes the training faster, but SGD can also do the training. It is okay to use any of them. For the Reddit dataset, we use the SGD optimizer consistently with https://github.com/ebagdasa/backdoor_federated_learning.

ybdai7 commented 1 year ago

Thanks for your reply. This helps a lot!

jhcknzzm / Federated-Learning-Backdoor

Problems when running IMDB task using the source code #15

Get trigger sentence