Closed qftie closed 2 years ago
This link may be helpful. https://github.com/huggingface/transformers/blob/v4.19.3/src/transformers/models/roberta/modeling_roberta.py#L807
In RoBERTa, if attention_mask is None, the attention of all tokens is 1 by default.
If the attention mask for all tokens is 1, wouldn't there be a problem dealing with multiple sequences? (Since the padding input_ids won't be ignored when calculating attention scores)
Let me explain with an example.
if batch = 2 sample instance1: [u1; u2; u3]
sample instance2: [u1; u2; u3; u4]
input
As you were concerned, We do not set the attention mask for pad tokens to 1. So it seems that attention mask will be set to 1 even for padding tokens when batch_size is greater than 1. We missed this part because we set batch_size to 1 when training.
When we train the model, the batch_size is set to 1. Therefore, these problems did not occur. Also if batch_size is greater than 1, even if the mask of the pad token is set to 1, it is expected that the model can ignore the padding part while training. However, your comments will make the model train more effectively. Thanks.
Thank you for your detailed explanation. Can I conclude that during training, when I manually set the batch_size to 16 for example, this will not have a negative impact on training due to attention mask issues? Or am I mistaken and did I read your comment wrong?
There may be a negative impact, but it is thought to be small. To remove the effect, the attention corresponding to the padding token must be set to 0.
You got it right.
I don't see any operations for attention_mask, which means if the Roberta model will set all attention_mask to 1?