BUG: "labels" information leakage into "input_ids" fields - incorrect attention_mask

First of all, thank you for the great open-source work! TL;DR: The attention_mask is not correctly setup. It does not mask the output field!

I took a careful look through the train.py and found an issue related to data preprocessing, i.e., generation of input_ids, labels, and attention_masks. The related functions/classes are:

preprocess
DataCollatorForSupervisedDataset
SupervisedDataset

In the function preprocess, input_ids are generated based on examples_tokenized which has information on both sources (instruction + input) and targets (output). In later stages, i.e., DataCollatorForSupervisedDataset, the authors create an attention_mask in the following way:

attention_mask=input_ids.ne(self.tokenizer.pad_token_id)

However, this does not mask the "output" information.

Therefore, essentially, what we are doing is the following: input_ids = instruction + input + output labels = output and the labels information is leaking into the input_ids which is not a correct implementation.

tatsu-lab / stanford_alpaca

BUG: "labels" information leakage into "input_ids" fields - incorrect attention_mask #290