tatsu-lab / stanford_alpaca

Code and documentation to train Stanford's Alpaca models, and generate the data.
https://crfm.stanford.edu/2023/03/13/alpaca.html
Apache License 2.0
29.56k stars 4.06k forks source link

BUG: "labels" information leakage into "input_ids" fields - incorrect attention_mask #290

Closed Nsigma-Bill closed 1 year ago

Nsigma-Bill commented 1 year ago

First of all, thank you for the great open-source work! TL;DR: The attention_mask is not correctly setup. It does not mask the output field!

I took a careful look through the train.py and found an issue related to data preprocessing, i.e., generation of input_ids, labels, and attention_masks. The related functions/classes are:

In the function preprocess, input_ids are generated based on examples_tokenized which has information on both sources (instruction + input) and targets (output). In later stages, i.e., DataCollatorForSupervisedDataset, the authors create an attention_mask in the following way:

attention_mask=input_ids.ne(self.tokenizer.pad_token_id)

However, this does not mask the "output" information.

Therefore, essentially, what we are doing is the following: input_ids = instruction + input + output labels = output and the labels information is leaking into the input_ids which is not a correct implementation.

lxuechen commented 1 year ago

Hi, this is not a bug. attention_mask is responsible for clearing out the attention weights to padding tokens. This mask is composed with the triangular mask for autoregressive modeling [here] during training (https://github.com/huggingface/transformers/blob/fd6735102abcc560cb2b68523b3f5012da54a956/src/transformers/models/llama/modeling_llama.py#L460).