Inconsistent padding behavior for decoder_input_ids for Seq2Seq models

rajcscw commented 2 years ago

System Info

transformers : 4.18.0 torch: 1.12.0 Python 3.7.13

Who can help?

@patrickvonplaten @patil-suraj

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

models = [
    "t5-small",
    "google/mt5-small",
    "facebook/m2m100_418M",
    "facebook/wmt19-ru-en",
    "facebook/bart-base",
    "facebook/blenderbot-400M-distill",
    "google/bigbird-pegasus-large-arxiv",
    "allenai/led-base-16384",
    "microsoft/prophetnet-large-uncased"
]

for model_name in models: 

    # load the seq2seq model
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

    # tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.padding_side = "left"

    # sample sentence
    sample_sentence = "generate some numbers"
    encodings = tokenizer(sample_sentence, 
                        padding="max_length",
                        max_length=5,
                        return_tensors="pt",
                        return_attention_mask=True,
                        truncation=True)

    # decoder input ids (with a default start token for the model)
    decoder_input_ids = torch.ones(1,1, dtype=torch.int32) * model.config.decoder_start_token_id

    # model's forward without any padding for decoder_input_ids (hence without decoder_attn mask)
    outputs = model.forward(input_ids=encodings.input_ids,
                            attention_mask=encodings.attention_mask,
                            decoder_input_ids=decoder_input_ids,
                            return_dict=True)
    next_token_logits = outputs["logits"][:,-1, :]

    # same decoder input ids but padded  + decoder attention mask
    decoder_input_ids_with_padding = torch.ones(1,3, dtype=torch.int32) * tokenizer.pad_token_id
    decoder_input_ids_with_padding[:,-1] = model.config.decoder_start_token_id
    decoder_attn_mask = torch.zeros(1,3)
    decoder_attn_mask[:,-1] = 1

    # model's forward with padding for decoder_input_ids (hence with decoder_attn mask)
    outputs_with_padding = model.forward(input_ids=encodings.input_ids,
                                        attention_mask=encodings.attention_mask,
                                        decoder_input_ids=decoder_input_ids_with_padding,
                                        decoder_attention_mask=decoder_attn_mask,
                                        return_dict=True)
    next_token_logits_with_padding = outputs_with_padding["logits"][:,-1,:]

    # check if padding affects the logits
    if torch.allclose(next_token_logits, next_token_logits_with_padding, atol=1e-3):
        print(f"No issues with model: {model_name}")
    else:
        print(f"Issues with model: {model_name}")

Expected behavior

This issue is regarding seq2seq models for conditional text generation.

There are differences in the output logits when padding is used for decoder_input_ids (by passing also decoder_attention_mask). This issue exists only for a few models (eg: BART, BlendorBot, Pegasus etc) and for other models there are no output differences (eg: T5, MT5 etc). Hence there is no consistency in the output across diff seq2seq models.

To reproduce these differences, run the provided script which does the following:

Do one forward pass for a sample prompt (input_ids, attention_mask), additionally passing the default start token for the decoder.
Do another forward pass for the prompt (same input_ids and attention_mask). But this time, decoder_input_ids is left padded to a seq length of 3 with the same default start token as the last token. Additionally, decoder_attention_mask is passed to avoid attending to padded tokens.
Last token logits from these two forward passes are compared for equivalence (with a tolerance of 1e-3)

And this is done for several seq2seq models to see which models have these differences.

Ideally, we would expect padding not to cause any such differences.

sgugger commented 2 years ago

cc @ArthurZucker

patrickvonplaten commented 2 years ago

@ArthurZucker let me know if you need help with this

jordiclive commented 2 years ago

@ArthurZucker I can have a look at this if it is not being looked at.

ArthurZucker commented 2 years ago

Hey! 🙌 it's on my to do list, but can't look at it right now so feel free to do so 😀🤗

jordiclive commented 2 years ago

@patrickvonplaten, I've had a look at this and stepped through BART.

I think it's solely to do with positional embeddings. For T5, MT5 there are relational embeddings, so it doesn't occur.

For certain types of models like the original Transformer where the positional embeddings are directly summed to the input embeddings. Any time there is left padding to the input, the positional encodings are not shifted. This happens for both the encoder and decoder forward pass with left side padding. So the left padding above actually affects the encoder output as well. When I shift the positional embeddings according to the mask the results are correct/same to unpadded case.

It is not usually a good idea to pad on the left side. I'm not sure if there is an efficient way to resolve this, as the input attention mask could be variable after left padding.

e.g.

tokenizer.padding_side = "left"
encodings = tokenizer.batch_encode_plus(['sample_sentence',
                                        'A much much much longer sentence.'],
                          padding="max_length",
                          max_length=10,
                          return_tensors="pt",
                          return_attention_mask=True,
                          truncation=True)

So can't use a batch fold operation.

Let me know if you think there should be a PR, as I would like to be involved as took me a while to work this out 😅

patrickvonplaten commented 2 years ago

Gently ping @ArthurZucker :-) Let me know if you'd like me to take over the issue if you have too much on your plate

jordiclive commented 2 years ago

Sure. I've found the root cause (positional embeddings aren't shifted along with the left padding) and I don't think it is necessarily an issue/resolvable. So only occurs with models that use non-relative positional embeddings e.g. BART

@ArthurZucker I'm happy to help out more if you think there is a resolution. Perhaps a PR with a warning?

jordiclive commented 2 years ago

The same problem happens when trying to left pad BERT or any model with absolute position embeddings. I notice BERT has a warning in the docs under tips.

I think this issue can be closed. I can draft a PR for adding to docs of other models with similar tip to BERT.

ArthurZucker commented 2 years ago

Hey! Really sorry for the late reply! Awesome work and debugging! 🤗 I totally get the gist of it 😅

Feel free to open a PR to either :

Add a Warning when padding is left that outputs might be incorrect (similar to BERT?)
Actually shift the positional embeddings when the padding is left. This might be a bit tricky

Even if it is not really recommended, if people actually use left padding (either unconsciously or for a particular application) it makes sense to shift the input!

rajcscw commented 2 years ago

@jordiclive @ArthurZucker Thanks for looking into this. Is left padding not recommended only due to position embeddings? In general, for batch next tokens prediction, it is easier for users to get the logits from the last token for the entire batch with left padding. (I remember GPT-2 had a similar issue and the left padding support was added at some point which made batch generation easier)

Also from the perspective of providing consistent behavior across many seq2seq models (through AutoModelForSeq2Seq API), shifting the positional embeddings in case of left padding is desired IMO.

jordiclive commented 2 years ago

@rajcscw. Yes, it is just because of the old-style positional embeddings. For gpt-2 and BERT, there is an optional kwarg for position_ids. This would be the only way to do it, the user would have to provide the position_ids as it could be variable for each input in the batch and then the positional embeddings can be shifted.

I am not sure about your exact use case for seq2seq models.

Above you have left padding with the tokenizer for the encoder input and then the manual left pad of decoder input ids. This would require two position_ids kwargs (encoder and decoder) for the model as they would likely be offset differently.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers