huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.33k stars 26.35k forks source link

Longformer: attention mask: documentation inconsistent with implementation (?) #25866

Closed mar4th3 closed 1 year ago

mar4th3 commented 1 year ago

System Info

Who can help?

@ArthurZucker @younesbelkada

Information

Tasks

Reproduction

First of all I am not sure whether this is actually a bug.

I cannot come up with a recipe to verify whether this is indeed an issue.

This is related to how the attention_mask should be defined for the LongformerModel.

In the documentation for the forward method it's stated that for the attention_mask (the one for the sliding window attention) a 1 is for not-masked (attended) tokens while 0 is for masked (not attended, e.g. padding) tokens.

However, in the forward method of LongformerSelfAttention the docstring says:

 """
        The *attention_mask* is changed in [`LongformerModel.forward`] from 0, 1, 2 to:

            - -10000: no attention
            - 0: local attention
            - +10000: global attention
        """

However in LongformerModel.forward I could not find any explicit conversion.

If you look at the forward method of the LongformerEncoder class there is this line:

is_index_masked = attention_mask < 0

which seems to validate the docstring in LongformerSelfAttention and contradict the documentation reported in LongformerModel, i.e. that to effectively mask padding tokens from attention the corresponding attention_mask value should be -1 and not 0.

Could someone please verify whether I am mistaken and missed something or this is actually a documention issue?

Thank you.

Expected behavior

Documentation matching code (?).

ydshieh commented 1 year ago

Hi @mar4th3

The attention_mask passed to LongformerModel has different meaning than the one passed to LongformerSelfAttention.

For LongformerModel, it is the one: 1 is for not-masked (attended) tokens while 0 is for masked (not attended, e.g. padding) tokens. Later is is changed to the 0, 1, 2 format here https://github.com/huggingface/transformers/blob/e95bcaeef0bd6b084b7615faae411a14d50bcfee/src/transformers/models/longformer/modeling_longformer.py#L1725 Then a few lines later, it is changed to the 10000, 0, 10000 format https://github.com/huggingface/transformers/blob/e95bcaeef0bd6b084b7615faae411a14d50bcfee/src/transformers/models/longformer/modeling_longformer.py#L1725 (you can check the detail of get_extended_attention_mask if you would like). This is the one expected by LongformerSelfAttention.

Hope this clarifies the things a bit 🤗

mar4th3 commented 1 year ago

Thanks for the swift reply. It does clarify.

I didn't think of checking what happens here.

Thanks again!