Longformer: attention mask: documentation inconsistent with implementation (?)

mar4th3 commented 1 year ago

System Info

transformers version: 4.32.0
Platform: Linux-5.3.18-150300.59.106-preempt-x86_64-with-glibc2.31
Python version: 3.9.7
Huggingface_hub version: 0.16.4
Safetensors version: 0.3.3
Accelerate version: 0.22.0
Accelerate config: not found
PyTorch version (GPU?): 2.1.0.dev20230827+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: irrelevant
Using distributed or parallel set-up in script?: irrelevant

Who can help?

@ArthurZucker @younesbelkada

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

First of all I am not sure whether this is actually a bug.

I cannot come up with a recipe to verify whether this is indeed an issue.

This is related to how the attention_mask should be defined for the LongformerModel.

In the documentation for the forward method it's stated that for the attention_mask (the one for the sliding window attention) a 1 is for not-masked (attended) tokens while 0 is for masked (not attended, e.g. padding) tokens.

However, in the forward method of LongformerSelfAttention the docstring says:

 """
        The *attention_mask* is changed in [`LongformerModel.forward`] from 0, 1, 2 to:

            - -10000: no attention
            - 0: local attention
            - +10000: global attention
        """

However in LongformerModel.forward I could not find any explicit conversion.

If you look at the forward method of the LongformerEncoder class there is this line:

is_index_masked = attention_mask < 0

which seems to validate the docstring in LongformerSelfAttention and contradict the documentation reported in LongformerModel, i.e. that to effectively mask padding tokens from attention the corresponding attention_mask value should be -1 and not 0.

Could someone please verify whether I am mistaken and missed something or this is actually a documention issue?

Thank you.

Expected behavior

Documentation matching code (?).

ydshieh commented 1 year ago

Hi @mar4th3

The attention_mask passed to LongformerModel has different meaning than the one passed to LongformerSelfAttention.

For LongformerModel, it is the one: 1 is for not-masked (attended) tokens while 0 is for masked (not attended, e.g. padding) tokens. Later is is changed to the 0, 1, 2 format here https://github.com/huggingface/transformers/blob/e95bcaeef0bd6b084b7615faae411a14d50bcfee/src/transformers/models/longformer/modeling_longformer.py#L1725 Then a few lines later, it is changed to the 10000, 0, 10000 format https://github.com/huggingface/transformers/blob/e95bcaeef0bd6b084b7615faae411a14d50bcfee/src/transformers/models/longformer/modeling_longformer.py#L1725 (you can check the detail of get_extended_attention_mask if you would like). This is the one expected by LongformerSelfAttention.

Hope this clarifies the things a bit 🤗

mar4th3 commented 1 year ago

Thanks for the swift reply. It does clarify.

I didn't think of checking what happens here.

Thanks again!

huggingface / transformers