EncoderDecoder does not automatically create decoder_attention_mask to match decoder_input_ids

huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

132.4k stars 26.37k forks source link

EncoderDecoder does not automatically create decoder_attention_mask to match decoder_input_ids #25271

Closed StevenSong closed 11 months ago

StevenSong commented 1 year ago

System Info

- `transformers` version: 4.31.0
- Platform: Linux-4.15.0-192-generic-x86_64-with-glibc2.27
- Python version: 3.11.4
- Huggingface_hub version: 0.16.4
- Safetensors version: 0.3.1
- Accelerate version: 0.21.0
- Accelerate config:    not found
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker @NielsRogge

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I'm using a pretrained BERT model to make a bert2bert model using an EncoderDecoderModel. According to the documentation and a deprecation warning in the source code, it says that you no longer need to pass in decoder_input_ids as they'll be automatically generated using labels. In the docs specifically, it also goes on to say that the default behavior of decoder_attention_mask is to automatically generate it based on padded tokens in decoder_input_ids, so you don't need to pass the decoder attention mask either, as expected.

However, when trying to just pass input_ids + attention_mask for the encoder and labels, I get a warning that says something to the effect of "we strongly recommend passing an attention mask". If I explicitly pass input_ids, attention_mask, decoder_input_ids, decoder_attention_mask, and labels, the warning goes away. Looking at the implementation of creating the decoder_input_ids from labels, it does indeed seem to skip the generation of decoder_attention_mask and simply passes through the value from the arguments, in this case None: https://github.com/huggingface/transformers/blob/e42587f596181396e1c4b63660abf0c736b10dae/src/transformers/models/encoder_decoder/modeling_encoder_decoder.py#L619-L637

You can recreate the warning in the notebook that Patrick made for the blog (https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Leveraging_Pre_trained_Checkpoints_for_Encoder_Decoder_Models.ipynb#scrollTo=yoN2q0hZUbXN&line=11&uniqifier=1). Specifically, in the process_data_to_model_inputs function, you can just comment out the lines which explicitly set decoder_input_ids and decoder_attention_mask.

Expected behavior

I'd expect that if you can just pass labels to the forward call of EncoderDecoder and it will create decoder_input_ids, it would also create decoder_attention_mask. The fix is probably a few lines:

        if (labels is not None) and (decoder_input_ids is None and decoder_inputs_embeds is None):
            decoder_input_ids = shift_tokens_right(
                labels, self.config.pad_token_id, self.config.decoder_start_token_id
            )
            if decoder_attention_mask is not None:
                raise Exception # some error for passing 1/2 of decoder input_id/attn_mask?
            decoder_attention_mask = torch.where(decoder_input_ids == self.config.pad_token_id, 0, 1)

StevenSong commented 1 year ago

somewhat related, it seems like in the notebook, the decoder_input_ids nor the labels are shifted; Patrick claims it's because:

"labels" are shifted automatically to the left for language modeling training.

but I don't see any evidence of this in the implementation. Was this behavior changed at some point? The notebook seems like it might be out of date?

My current solution to the original decoder_attention_mask issue is to manually pass in decoder_input_ids shifted 1 to the right with matching decoder_attention_mask, while labels remains unchanged.

amyeroberts commented 1 year ago

cc @ArthurZucker @younesbelkada

ArthurZucker commented 1 year ago

Sorry @StevenSong did not really have the time to look at this, will do so when I can!

ArthurZucker commented 1 year ago

Edit, as this is not super high priority, I'll leave the community work on it. It's tagged as a good second issue. Main "concern" is that the decoder attention masks are not always the shifted labels and can be model specific, but we can still have a default! 🤗

xplip commented 11 months ago

Hi, I've noticed this seems to be the same for other model classes, e.g. BART/mBART and T5. For all of them, the documentation states:

decoder_attention_mask (`torch.LongTensor` of shape `(batch_size, target_sequence_length)`, *optional*):
    Default behavior: generate a tensor that ignores pad tokens in `decoder_input_ids`. Causal mask will also
    be used by default.

but then it seems only a causal mask is used if no attention mask is passed to the model explicitly, see e.g. https://github.com/huggingface/transformers/blob/2f3ea08a077ba3133fa8a604b22436cad250b055/src/transformers/models/bart/modeling_bart.py#L932-L953). In comparison, the original fairseq implementation for BART/mBART takes padding into account by default: https://github.com/facebookresearch/fairseq/blob/7409af7f9a7b6ddac4cbfe7cafccc715b3c1b21e/fairseq/models/transformer/transformer_decoder.py#L327-L329. I would think this is the same for T5.

The fact this doesn't seem to be done here is a bit misleading. Users might not be aware they need to pass the correct attention masks themselves, especially considering none of the examples in the respective model docs or training scripts like https://github.com/huggingface/transformers/blob/v4.32.0/examples/pytorch/translation/run_translation_no_trainer.py pass decoder attention masks either.

hackyon commented 11 months ago

Re: @xplip - I believe there was a discussion around why they don't automatically create the attention mask here. There is usually a warning if they find a padding token when attention_mask = None, but I think it's missing in BART (I can look into adding the warning to BART in a separate change).

In this particular case, since we are automatically creating decoder_input_ids from labels, I think it makes sense to automatically create a default corresponding attention_mask as well (if it's not already supplied by the user). Seems like the warning is working as intended since it raised this particular issue.

hackyon commented 11 months ago

I created a pull request to fix this here: #26752

hackyon commented 11 months ago

Re: @StevenSong regarding "labels" are shifted automatically to the left for language modeling training.

The colab you linked seems to use BertLMHeadModel for the decoder. I believe the left shifting behavior is here.

rangehow commented 6 months ago

Indeed the question xplip mentioned above is related to inconsistent document and code behavior, while hackyon's PR only fix pad mask for a specific EncoderDecoderModel. The question still exist for bart/t5 like model in huggingface. If a user want to train bart/t5 like model use DataCollatorForSeq2Seq，seq2seqtrainer in transformers, the default decoder mask is incorrect.

hackyon commented 6 months ago

@rangehow

That seems fair. If we are creating default decoder_input_ids from labels, then it makes sense to create a corresponding default for the decoder_attention_mask. Basically, do https://github.com/huggingface/transformers/pull/26752 but for more models.

Obviously, if we are supplied with either the decoder_input_ids or decoder_attention_masks, we won't generate any defaults.

Also, do you happen to have some code that illustrate this? I imagine you'd run into the "strongly recommend passing in attention mask" warning when training with default args using Trainer on Bert/T5, but can you verify this?

I'm curious to hear opinions from @gante and @ArthurZucker regarding this. I can add this to more models if they're good with it (unless @rangehow wants to do it).

For reference, here's the code in T5 where we create default decoder_input_ids: https://github.com/huggingface/transformers/blob/e42587f596181396e1c4b63660abf0c736b10dae/src/transformers/models/t5/modeling_t5.py#L1703

If we don't pass in an attention mask, it just creates a default one with all one's: https://github.com/huggingface/transformers/blob/e42587f596181396e1c4b63660abf0c736b10dae/src/transformers/models/t5/modeling_t5.py#L1004

gante commented 6 months ago

@hackyon what you wrote makes sense. Ideally, we'd support correct implicit attention masks as well (just like we often do with other inputs, when they can be derived)

rangehow commented 6 months ago

If we don't pass in an attention mask, it just creates a default one with all one's:

I'm a bit unsure, all one matrix is not to mask any tokens, isn't it？Bart at least created a causal mask(even without pad mask), while t5 create no mask? Are these two actions correct? I think we may have some main disagreement in this part. I would like to learn from you.

That seems fair. If we are creating default decoder_input_ids from labels, then it makes sense to create a corresponding default for the decoder_attention_mask. Basically, do #26752 but for more models.

I didn't think this logic was wrong, on the contrary, it was just executed incorrectly. The decoder_attn_mask usually consists of a pad mask and a causal mask, but we can see the default decoder_attention_mask generated below only have a causal mask(without pad mask if you don't pass it into explicitly. https://github.com/huggingface/transformers/blob/2f3ea08a077ba3133fa8a604b22436cad250b055/src/transformers/models/bart/modeling_bart.py#L932-L953

However, the document describes itself as containing two types of masks, which may mislead users.

decoder_attention_mask (torch.LongTensor of shape (batch_size, target_sequence_length), optional): Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.

I hope I have made it clear. If not, I am willing to continue adding more : )

Also, do you happen to have some code that illustrate this? I imagine you'd run into the "strongly recommend passing in attention mask" warning when training with default args using Trainer on Bert/T5, but can you verify this?

Yes, I do have a try without any related mask warning. I train bart，and only pass input_ids and labels to DataCollatorForSeq2Seq，DataCollatorForSeq2Seq call the model method shift_label to generate decoder_input_ids，it's cool. while decoder mask left for forward pass to auto generate. The entire procedure has no warning about the decoder mask.

rangehow commented 6 months ago

Also I found something bad in this scene that we cannot pass decoder_attention_mask manually to DataCollatorForSeq2Seq. because the call for tokenizer.pad won't deal with arbitrary keys（Unfortunately, decoder_attention_mask happens to be included） https://github.com/huggingface/transformers/blob/main/src/transformers/tokenization_utils_base.py#L3166-L3337

So if a user don't write collator himself -> DataCollatorForSeq2Seq don't allow pass decoder_attention_mask -> Relying on model automatic generation decoder_attention_mask -> not even right mask(in my opinion) -> wrong training behavior

It is possible to temporarily mock the function of Bart without modifying the source code, but for T5, modifying it is more complicated because it is embedded in the forward of the decoder, at least Bart is independent : (

hackyon commented 6 months ago

I'm a bit unsure, all one matrix is not to mask any tokens, isn't it？Bart at least created a causal mask(even without pad mask), while t5 create no mask? Are these two actions correct? I think we may have some main disagreement in this part. I would like to learn from you.

I believe most models, including T5, already try to create a causal mask if it's a decoder. That's not the main issue here.

I didn't think this logic was wrong, on the contrary, it was just executed incorrectly. The decoder_attn_mask usually consists of a pad mask and a causal mask, but we can see the default decoder_attention_mask generated below only have a causal mask(without pad mask if you don't pass it into explicitly.

Yea, that's the issue. I think it was a design decision not to automatically generate padded attention masks. See https://github.com/huggingface/transformers/issues/15479#issuecomment-1066639938.

decoder_attention_mask (torch.LongTensor of shape (batch_size, target_sequence_length), optional): Default behavior: generate a tensor that ignores pad tokens in decoder_input_ids. Causal mask will also be used by default.

Pretty sure the comment on the default behavior is incorrect. We should probably update them if possible.

Also, do you happen to have some code that illustrate this? I imagine you'd run into the "strongly recommend passing in attention mask" warning when training with default args using Trainer on Bert/T5, but can you verify this?

Yes, I do have a try without any related mask warning. I train bart，and only pass input_ids and labels to DataCollatorForSeq2Seq，DataCollatorForSeq2Seq call the model method shift_label to generate decoder_input_ids，it's cool. while decoder mask left for forward pass to auto generate. The entire procedure has no warning about the decoder mask.

Oh interesting. Would it be possible for you to share this code by any chance?

Also I found something bad in this scene that we cannot pass decoder_attention_mask manually to DataCollatorForSeq2Seq.

Yea, do you have the code for this? Looking at this a second time, I think, depending on the way you coded up the DataLoader, you can (and should) pass in the appropriate decoder attention mask for training with the Trainer.

rangehow commented 6 months ago

Edited: Although we need pad mask in theory. But causal mask will make each token only focus on the past token(fortunately pad token in manner is set on the right side(future token) for encoder-decoder model). So even without pad mask, the behaviour is acceptable. I have no doubt about why we don't need a pad mask. 😄

Origin:

Pretty sure the comment on the default behavior is incorrect. We should probably update them if possible.

This is precisely the main reason why I raised the issue. If it is not suitable to implicitly create a pad mask for situations where the decoder mask has not been passed, at least these documents with incorrect expressions should also be updated and warning should be raised.

I think, depending on the way you coded up the DataLoader, you can (and should) pass in the appropriate decoder attention mask for training with the Trainer.

The default or appropriate method working with trainer is not to set dataloader manually(even the trainer does not accept this parameter), so I guess you're talking about a custom collator? In that situation yes, we can pass decoder_mask in way we want.

Oh interesting. Would it be possible for you to share this code by any chance?

Sure, here is the minium code sinppet . I use iwslt17 dataset(hf available and extract them out to json, for reproduction, each object should like {'en': ... , 'de': ...}) tokenizer is file I learn from scratch, for reproduction, you can just use official checkpoint on hf.

from torch.utils.data import Dataset
class MyDataset(Dataset):
    def __init__(self,data,tokenizer):
        src=[d['en'] for d in data]
        tgt=[d['de'] for d in data]
        input_id=tokenizer(src)
        tgt_id=tokenizer(tgt)

        self.data={'input_ids':input_id.input_ids,'attention_mask':input_id.attention_mask} 
    def __getitem__(self, index):
        return {'input_ids':self.data['input_ids'][index],
                   'labels':self.data['labels'][index],
                }
    def __len__(self):
        return len(self.data['input_ids'])

tokenizer=BartTokenizerFast.from_pretrained('/data/ruanjh/best_training_method/bart')
with open('iwslt17/train.json') as f:
    train_data=json.load(f)
with open('iwslt17/validation.json') as f:
    eval_data=json.load(f)

train_dataset=MyDataset(train_data,tokenizer)
eval_dataset=MyDataset(eval_data,tokenizer)

config=BartConfig(
    encoder_layers=6,
    decoder_layers=6,
    vocab_size=30000,
)

model=BartForConditionalGeneration(config)
collator= DataCollatorForSeq2Seq(tokenizer,model=model,padding=True)
trainer = Seq2SeqTrainer(
        model=model,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
        args=Seq2SeqTrainingArguments(
            optim='adamw_apex_fused',
            overwrite_output_dir =True,
            output_dir="/NAIVE_BART",
            logging_steps=5,
            remove_unused_columns =False,
            gradient_accumulation_steps=8,
            #------------------------------
            evaluation_strategy='steps',
            # eval_delay=100,
            eval_steps =100,
            #-------------------------------
            save_strategy ='steps',
            save_steps = 100,
            save_total_limit =3,
            load_best_model_at_end=True,
            #--------------------------------
            dataloader_num_workers =0,
            num_train_epochs=10,
            # auto_find_batch_size=True,
            per_device_train_batch_size=16,
            per_device_eval_batch_size =16,
            fp16=True,
            prediction_loss_only=True,
            # save_safetensors =False,
            # torch_compile=True,
            # torch_compile_backend='inductor',
            # torch_compile_mode='max-autotune',
        ),
        data_collator=collator,

    )

trainer.train()

Also I found something bad in this scene that we cannot pass decoder_attention_mask manually to DataCollatorForSeq2Seq. Yea, do you have the code for this? Looking at this a second time, I think, depending on the way you coded up the DataLoader, you can (and should) pass in the appropriate decoder attention mask for training with the Trainer.

This is not very important in the conversation, but if you are interested in giving it a try, you can add an arbitrary length of decoder-attention mask to the get_item method (although I don't think this is a suitable place, the appropriate place should be the collator, but as I said, this collator does not have such behavior)