Closed rajcscw closed 1 year ago
cc @ArthurZucker
@ArthurZucker let me know if you need help with this
@ArthurZucker I can have a look at this if it is not being looked at.
Hey! 🙌 it's on my to do list, but can't look at it right now so feel free to do so 😀🤗
@patrickvonplaten, I've had a look at this and stepped through BART.
I think it's solely to do with positional embeddings. For T5, MT5 there are relational embeddings, so it doesn't occur.
For certain types of models like the original Transformer where the positional embeddings are directly summed to the input embeddings. Any time there is left padding to the input, the positional encodings are not shifted. This happens for both the encoder and decoder forward pass with left side padding. So the left padding above actually affects the encoder output as well. When I shift the positional embeddings according to the mask the results are correct/same to unpadded case.
It is not usually a good idea to pad on the left side. I'm not sure if there is an efficient way to resolve this, as the input attention mask could be variable after left padding.
e.g.
tokenizer.padding_side = "left"
encodings = tokenizer.batch_encode_plus(['sample_sentence',
'A much much much longer sentence.'],
padding="max_length",
max_length=10,
return_tensors="pt",
return_attention_mask=True,
truncation=True)
So can't use a batch fold operation.
Let me know if you think there should be a PR, as I would like to be involved as took me a while to work this out 😅
Gently ping @ArthurZucker :-) Let me know if you'd like me to take over the issue if you have too much on your plate
Sure. I've found the root cause (positional embeddings aren't shifted along with the left padding) and I don't think it is necessarily an issue/resolvable. So only occurs with models that use non-relative positional embeddings e.g. BART
@ArthurZucker I'm happy to help out more if you think there is a resolution. Perhaps a PR with a warning?
The same problem happens when trying to left pad BERT or any model with absolute position embeddings. I notice BERT has a warning in the docs under tips.
I think this issue can be closed. I can draft a PR for adding to docs of other models with similar tip to BERT.
Hey! Really sorry for the late reply! Awesome work and debugging! 🤗 I totally get the gist of it 😅
Feel free to open a PR to either :
Even if it is not really recommended, if people actually use left padding (either unconsciously or for a particular application) it makes sense to shift the input!
@jordiclive @ArthurZucker Thanks for looking into this. Is left padding not recommended only due to position embeddings? In general, for batch next tokens prediction, it is easier for users to get the logits from the last token for the entire batch with left padding. (I remember GPT-2 had a similar issue and the left padding support was added at some point which made batch generation easier)
Also from the perspective of providing consistent behavior across many seq2seq models (through AutoModelForSeq2Seq API), shifting the positional embeddings in case of left padding is desired IMO.
@rajcscw. Yes, it is just because of the old-style positional embeddings. For gpt-2 and BERT, there is an optional kwarg for position_ids. This would be the only way to do it, the user would have to provide the position_ids as it could be variable for each input in the batch and then the positional embeddings can be shifted.
I am not sure about your exact use case for seq2seq models.
Above you have left padding with the tokenizer for the encoder input and then the manual left pad of decoder input ids. This would require two position_ids kwargs (encoder and decoder) for the model as they would likely be offset differently.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers : 4.18.0 torch: 1.12.0 Python 3.7.13
Who can help?
@patrickvonplaten @patil-suraj
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
This issue is regarding seq2seq models for conditional text generation.
There are differences in the output logits when padding is used for decoder_input_ids (by passing also decoder_attention_mask). This issue exists only for a few models (eg: BART, BlendorBot, Pegasus etc) and for other models there are no output differences (eg: T5, MT5 etc). Hence there is no consistency in the output across diff seq2seq models.
To reproduce these differences, run the provided script which does the following:
And this is done for several seq2seq models to see which models have these differences.
Ideally, we would expect padding not to cause any such differences.