Closed Nsigma-Bill closed 1 year ago
Hi, this is not a bug. attention_mask
is responsible for clearing out the attention weights to padding tokens. This mask is composed with the triangular mask for autoregressive modeling [here] during training (https://github.com/huggingface/transformers/blob/fd6735102abcc560cb2b68523b3f5012da54a956/src/transformers/models/llama/modeling_llama.py#L460).
First of all, thank you for the great open-source work! TL;DR: The
attention_mask
is not correctly setup. It does not mask theoutput
field!I took a careful look through the train.py and found an issue related to data preprocessing, i.e., generation of
input_ids
,labels
, andattention_masks
. The related functions/classes are:In the function preprocess,
input_ids
are generated based onexamples_tokenized
which has information on bothsources
(instruction + input) andtargets
(output). In later stages, i.e., DataCollatorForSupervisedDataset, the authors create anattention_mask
in the following way:attention_mask=input_ids.ne(self.tokenizer.pad_token_id)
However, this does not mask the "output" information.
Therefore, essentially, what we are doing is the following:
input_ids = instruction + input + output
labels = output
and thelabels
information is leaking into theinput_ids
which is not a correct implementation.