Closed StevenSong closed 11 months ago
somewhat related, it seems like in the notebook, the decoder_input_ids
nor the labels
are shifted; Patrick claims it's because:
"labels"
are shifted automatically to the left for language modeling training.
but I don't see any evidence of this in the implementation. Was this behavior changed at some point? The notebook seems like it might be out of date?
My current solution to the original decoder_attention_mask
issue is to manually pass in decoder_input_ids
shifted 1 to the right with matching decoder_attention_mask
, while labels
remains unchanged.
cc @ArthurZucker @younesbelkada
Sorry @StevenSong did not really have the time to look at this, will do so when I can!
Edit, as this is not super high priority, I'll leave the community work on it. It's tagged as a good second issue. Main "concern" is that the decoder attention masks are not always the shifted labels and can be model specific, but we can still have a default! 🤗
Hi, I've noticed this seems to be the same for other model classes, e.g. BART/mBART and T5. For all of them, the documentation states:
decoder_attention_mask (`torch.LongTensor` of shape `(batch_size, target_sequence_length)`, *optional*):
Default behavior: generate a tensor that ignores pad tokens in `decoder_input_ids`. Causal mask will also
be used by default.
but then it seems only a causal mask is used if no attention mask is passed to the model explicitly, see e.g. https://github.com/huggingface/transformers/blob/2f3ea08a077ba3133fa8a604b22436cad250b055/src/transformers/models/bart/modeling_bart.py#L932-L953). In comparison, the original fairseq implementation for BART/mBART takes padding into account by default: https://github.com/facebookresearch/fairseq/blob/7409af7f9a7b6ddac4cbfe7cafccc715b3c1b21e/fairseq/models/transformer/transformer_decoder.py#L327-L329. I would think this is the same for T5.
The fact this doesn't seem to be done here is a bit misleading. Users might not be aware they need to pass the correct attention masks themselves, especially considering none of the examples in the respective model docs or training scripts like https://github.com/huggingface/transformers/blob/v4.32.0/examples/pytorch/translation/run_translation_no_trainer.py pass decoder attention masks either.
Re: @xplip - I believe there was a discussion around why they don't automatically create the attention mask here. There is usually a warning if they find a padding token when attention_mask = None, but I think it's missing in BART (I can look into adding the warning to BART in a separate change).
In this particular case, since we are automatically creating decoder_input_ids from labels, I think it makes sense to automatically create a default corresponding attention_mask as well (if it's not already supplied by the user). Seems like the warning is working as intended since it raised this particular issue.
I created a pull request to fix this here: #26752
Re: @StevenSong regarding "labels" are shifted automatically to the left for language modeling training.
The colab you linked seems to use BertLMHeadModel for the decoder. I believe the left shifting behavior is here.
Indeed the question xplip mentioned above is related to inconsistent document and code behavior, while hackyon's PR only fix pad mask for a specific EncoderDecoderModel
. The question still exist for bart/t5 like model in huggingface.
If a user want to train bart/t5 like model use DataCollatorForSeq2Seq,seq2seqtrainer in transformers, the default decoder mask is incorrect.
@rangehow
That seems fair. If we are creating default decoder_input_ids from labels, then it makes sense to create a corresponding default for the decoder_attention_mask. Basically, do https://github.com/huggingface/transformers/pull/26752 but for more models.
Obviously, if we are supplied with either the decoder_input_ids or decoder_attention_masks, we won't generate any defaults.
Also, do you happen to have some code that illustrate this? I imagine you'd run into the "strongly recommend passing in attention mask" warning when training with default args using Trainer on Bert/T5, but can you verify this?
I'm curious to hear opinions from @gante and @ArthurZucker regarding this. I can add this to more models if they're good with it (unless @rangehow wants to do it).
For reference, here's the code in T5 where we create default decoder_input_ids: https://github.com/huggingface/transformers/blob/e42587f596181396e1c4b63660abf0c736b10dae/src/transformers/models/t5/modeling_t5.py#L1703
If we don't pass in an attention mask, it just creates a default one with all one's: https://github.com/huggingface/transformers/blob/e42587f596181396e1c4b63660abf0c736b10dae/src/transformers/models/t5/modeling_t5.py#L1004
@hackyon what you wrote makes sense. Ideally, we'd support correct implicit attention masks as well (just like we often do with other inputs, when they can be derived)
If we don't pass in an attention mask, it just creates a default one with all one's:
I'm a bit unsure, all one matrix is not to mask any tokens, isn't it?Bart at least created a causal mask(even without pad mask), while t5 create no mask? Are these two actions correct? I think we may have some main disagreement in this part. I would like to learn from you.
That seems fair. If we are creating default decoder_input_ids from labels, then it makes sense to create a corresponding default for the decoder_attention_mask. Basically, do #26752 but for more models.
I didn't think this logic was wrong, on the contrary, it was just executed incorrectly. The decoder_attn_mask usually consists of a pad mask and a causal mask, but we can see the default decoder_attention_mask generated below only have a causal mask(without pad mask if you don't pass it into explicitly. https://github.com/huggingface/transformers/blob/2f3ea08a077ba3133fa8a604b22436cad250b055/src/transformers/models/bart/modeling_bart.py#L932-L953
However, the document describes itself as containing two types of masks, which may mislead users.
decoder_attention_mask (
torch.LongTensor
of shape(batch_size, target_sequence_length)
, optional): Default behavior: generate a tensor that ignores pad tokens indecoder_input_ids
. Causal mask will also be used by default.
I hope I have made it clear. If not, I am willing to continue adding more : )
Also, do you happen to have some code that illustrate this? I imagine you'd run into the "strongly recommend passing in attention mask" warning when training with default args using Trainer on Bert/T5, but can you verify this?
Yes, I do have a try without any related mask warning. I train bart,and only pass input_ids and labels to DataCollatorForSeq2Seq,DataCollatorForSeq2Seq call the model method shift_label to generate decoder_input_ids,it's cool. while decoder mask left for forward pass to auto generate. The entire procedure has no warning about the decoder mask.
Also I found something bad in this scene that we cannot pass decoder_attention_mask manually to DataCollatorForSeq2Seq. because the call for tokenizer.pad won't deal with arbitrary keys(Unfortunately, decoder_attention_mask happens to be included) https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L3166-L3337
So if a user don't write collator himself -> DataCollatorForSeq2Seq don't allow pass decoder_attention_mask -> Relying on model automatic generation decoder_attention_mask -> not even right mask(in my opinion) -> wrong training behavior
It is possible to temporarily mock the function of Bart without modifying the source code, but for T5, modifying it is more complicated because it is embedded in the forward of the decoder, at least Bart is independent : (
I'm a bit unsure, all one matrix is not to mask any tokens, isn't it?Bart at least created a causal mask(even without pad mask), while t5 create no mask? Are these two actions correct? I think we may have some main disagreement in this part. I would like to learn from you.
I believe most models, including T5, already try to create a causal mask if it's a decoder. That's not the main issue here.
I didn't think this logic was wrong, on the contrary, it was just executed incorrectly. The decoder_attn_mask usually consists of a pad mask and a causal mask, but we can see the default decoder_attention_mask generated below only have a causal mask(without pad mask if you don't pass it into explicitly.
Yea, that's the issue. I think it was a design decision not to automatically generate padded attention masks. See https://github.com/huggingface/transformers/issues/15479#issuecomment-1066639938.
decoder_attention_mask (
torch.LongTensor
of shape(batch_size, target_sequence_length)
, optional): Default behavior: generate a tensor that ignores pad tokens indecoder_input_ids
. Causal mask will also be used by default.
Pretty sure the comment on the default behavior is incorrect. We should probably update them if possible.
Also, do you happen to have some code that illustrate this? I imagine you'd run into the "strongly recommend passing in attention mask" warning when training with default args using Trainer on Bert/T5, but can you verify this?
Yes, I do have a try without any related mask warning. I train bart,and only pass input_ids and labels to DataCollatorForSeq2Seq,DataCollatorForSeq2Seq call the model method shift_label to generate decoder_input_ids,it's cool. while decoder mask left for forward pass to auto generate. The entire procedure has no warning about the decoder mask.
Oh interesting. Would it be possible for you to share this code by any chance?
Also I found something bad in this scene that we cannot pass decoder_attention_mask manually to DataCollatorForSeq2Seq.
Yea, do you have the code for this? Looking at this a second time, I think, depending on the way you coded up the DataLoader, you can (and should) pass in the appropriate decoder attention mask for training with the Trainer.
Edited: Although we need pad mask in theory. But causal mask will make each token only focus on the past token(fortunately pad token in manner is set on the right side(future token) for encoder-decoder model). So even without pad mask, the behaviour is acceptable. I have no doubt about why we don't need a pad mask. 😄
Origin:
Pretty sure the comment on the default behavior is incorrect. We should probably update them if possible.
This is precisely the main reason why I raised the issue. If it is not suitable to implicitly create a pad mask for situations where the decoder mask has not been passed, at least these documents with incorrect expressions should also be updated and warning should be raised.
I think, depending on the way you coded up the DataLoader, you can (and should) pass in the appropriate decoder attention mask for training with the Trainer.
The default or appropriate method working with trainer is not to set dataloader manually(even the trainer does not accept this parameter), so I guess you're talking about a custom collator? In that situation yes, we can pass decoder_mask in way we want.
Oh interesting. Would it be possible for you to share this code by any chance?
Sure, here is the minium code sinppet . I use iwslt17 dataset(hf available and extract them out to json, for reproduction, each object should like {'en': ... , 'de': ...}) tokenizer is file I learn from scratch, for reproduction, you can just use official checkpoint on hf.
from torch.utils.data import Dataset
class MyDataset(Dataset):
def __init__(self,data,tokenizer):
src=[d['en'] for d in data]
tgt=[d['de'] for d in data]
input_id=tokenizer(src)
tgt_id=tokenizer(tgt)
self.data={'input_ids':input_id.input_ids,'attention_mask':input_id.attention_mask}
def __getitem__(self, index):
return {'input_ids':self.data['input_ids'][index],
'labels':self.data['labels'][index],
}
def __len__(self):
return len(self.data['input_ids'])
tokenizer=BartTokenizerFast.from_pretrained('/data/ruanjh/best_training_method/bart')
with open('iwslt17/train.json') as f:
train_data=json.load(f)
with open('iwslt17/validation.json') as f:
eval_data=json.load(f)
train_dataset=MyDataset(train_data,tokenizer)
eval_dataset=MyDataset(eval_data,tokenizer)
config=BartConfig(
encoder_layers=6,
decoder_layers=6,
vocab_size=30000,
)
model=BartForConditionalGeneration(config)
collator= DataCollatorForSeq2Seq(tokenizer,model=model,padding=True)
trainer = Seq2SeqTrainer(
model=model,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
args=Seq2SeqTrainingArguments(
optim='adamw_apex_fused',
overwrite_output_dir =True,
output_dir="/NAIVE_BART",
logging_steps=5,
remove_unused_columns =False,
gradient_accumulation_steps=8,
#------------------------------
evaluation_strategy='steps',
# eval_delay=100,
eval_steps =100,
#-------------------------------
save_strategy ='steps',
save_steps = 100,
save_total_limit =3,
load_best_model_at_end=True,
#--------------------------------
dataloader_num_workers =0,
num_train_epochs=10,
# auto_find_batch_size=True,
per_device_train_batch_size=16,
per_device_eval_batch_size =16,
fp16=True,
prediction_loss_only=True,
# save_safetensors =False,
# torch_compile=True,
# torch_compile_backend='inductor',
# torch_compile_mode='max-autotune',
),
data_collator=collator,
)
trainer.train()
Also I found something bad in this scene that we cannot pass decoder_attention_mask manually to DataCollatorForSeq2Seq. Yea, do you have the code for this? Looking at this a second time, I think, depending on the way you coded up the DataLoader, you can (and should) pass in the appropriate decoder attention mask for training with the Trainer.
This is not very important in the conversation, but if you are interested in giving it a try, you can add an arbitrary length of decoder-attention mask to the get_item method (although I don't think this is a suitable place, the appropriate place should be the collator, but as I said, this collator does not have such behavior)
System Info
Who can help?
@ArthurZucker @NielsRogge
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I'm using a pretrained BERT model to make a bert2bert model using an EncoderDecoderModel. According to the documentation and a deprecation warning in the source code, it says that you no longer need to pass in
decoder_input_ids
as they'll be automatically generated usinglabels
. In the docs specifically, it also goes on to say that the default behavior ofdecoder_attention_mask
is to automatically generate it based on padded tokens indecoder_input_ids
, so you don't need to pass the decoder attention mask either, as expected.However, when trying to just pass
input_ids + attention_mask
for the encoder andlabels
, I get a warning that says something to the effect of "we strongly recommend passing an attention mask". If I explicitly passinput_ids, attention_mask, decoder_input_ids, decoder_attention_mask, and labels
, the warning goes away. Looking at the implementation of creating thedecoder_input_ids
fromlabels
, it does indeed seem to skip the generation ofdecoder_attention_mask
and simply passes through the value from the arguments, in this caseNone
: https://github.com/huggingface/transformers/blob/e42587f596181396e1c4b63660abf0c736b10dae/src/transformers/models/encoder_decoder/modeling_encoder_decoder.py#L619-L637You can recreate the warning in the notebook that Patrick made for the blog (https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Leveraging_Pre_trained_Checkpoints_for_Encoder_Decoder_Models.ipynb#scrollTo=yoN2q0hZUbXN&line=11&uniqifier=1). Specifically, in the
process_data_to_model_inputs
function, you can just comment out the lines which explicitly setdecoder_input_ids
anddecoder_attention_mask
.Expected behavior
I'd expect that if you can just pass
labels
to the forward call of EncoderDecoder and it will createdecoder_input_ids
, it would also createdecoder_attention_mask
. The fix is probably a few lines: