Closed mar4th3 closed 1 year ago
Hi @mar4th3
The attention_mask
passed to LongformerModel
has different meaning than the one passed to LongformerSelfAttention
.
For LongformerModel
, it is the one: 1 is for not-masked (attended) tokens while 0 is for masked (not attended, e.g. padding) tokens.
Later is is changed to the 0, 1, 2
format here
https://github.com/huggingface/transformers/blob/e95bcaeef0bd6b084b7615faae411a14d50bcfee/src/transformers/models/longformer/modeling_longformer.py#L1725
Then a few lines later, it is changed to the 10000, 0, 10000
format
https://github.com/huggingface/transformers/blob/e95bcaeef0bd6b084b7615faae411a14d50bcfee/src/transformers/models/longformer/modeling_longformer.py#L1725
(you can check the detail of get_extended_attention_mask
if you would like).
This is the one expected by LongformerSelfAttention
.
Hope this clarifies the things a bit 🤗
System Info
transformers
version: 4.32.0Who can help?
@ArthurZucker @younesbelkada
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
First of all I am not sure whether this is actually a bug.
I cannot come up with a recipe to verify whether this is indeed an issue.
This is related to how the
attention_mask
should be defined for the LongformerModel.In the documentation for the forward method it's stated that for the
attention_mask
(the one for the sliding window attention) a 1 is for not-masked (attended) tokens while 0 is for masked (not attended, e.g. padding) tokens.However, in the forward method of
LongformerSelfAttention
the docstring says:However in LongformerModel.forward I could not find any explicit conversion.
If you look at the forward method of the
LongformerEncoder
class there is this line:which seems to validate the docstring in
LongformerSelfAttention
and contradict the documentation reported inLongformerModel
, i.e. that to effectively mask padding tokens from attention the correspondingattention_mask
value should be -1 and not 0.Could someone please verify whether I am mistaken and missed something or this is actually a documention issue?
Thank you.
Expected behavior
Documentation matching code (?).