federicoruggeri commented 2 years ago

Environment info

transformers version: 4.4.2
Platform: Linux-5.4.0-48-generic-x86_64-with-Ubuntu-20.04-focal
Python version: 3.6.13
PyTorch version (GPU?): 1.8.1+cu102 (True)
Tensorflow version (GPU?): 2.3.0 (True)
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help

@patrickvonplaten @Rocketknight1

Information

Model I am using (Bert, XLNet ...): 'allenai/led-base-16384' via AutoModelForSeq2SeqLM

The problem arises when using:

[ ] the official example scripts: (give details below)
[x] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Run the TensorFlow version of a simple test script.


from transformers import TFAutoModelForSeq2SeqLM
import tensorflow as tf
import numpy as np

@tf.function def test_gradient(inputs): with tf.GradientTape() as tape: led_output = led.call( input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'], labels=inputs['labels'], global_attention_mask=inputs['global_attention_mask'] if 'global_attention_mask' in inputs else None, training=True, use_cache=False, return_dict=True, output_hidden_states=True)

    grads = tape.gradient(led_output['loss'], led.trainable_variables)

return led_output

@tf.function def test_model(inputs): led_output = led.call( input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'], labels=inputs['labels'], global_attention_mask=inputs['global_attention_mask'] if 'global_attention_mask' in inputs else None, training=True, use_cache=False, return_dict=True, output_hidden_states=True)

return led_output

preloaded_name = 'allenai/led-base-16384' led = TFAutoModelForSeq2SeqLM.from_pretrained(preloaded_name, from_pt=True)

""" In this example, we have the following shapes: input_length --> 1800 output_length --> 70 """ inputs = np.load('inputs_with_mask.npy', allow_pickle=True).item()

print('Inputs...') for key, value in inputs.items(): print('Key: {0} - Value: {1}'.format(key, value.shape))

""" Prints: input_ids - Value: (1, 1800) attention_mask - Value: (1, 1800) global_attention_mask - Value: (1, 1800) labels - Value: (1, 70) """

Test with gradient tape

led_output = test_gradient(inputs=inputs)

Test without gradient tape

led_output = test_model(inputs=inputs)


Running above scripts throws the following error:
```python
Traceback (most recent call last):
  File "/home/fruggeri/Repositories/arg-chatbot/runnables/tests/test_led_tf.py", line 47, in <module>
    led_output = test_model(inputs=inputs)
  File "/home/fruggeri/tf2.3/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 780, in __call__
    result = self._call(*args, **kwds)
  File "/home/fruggeri/tf2.3/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 846, in _call
    return self._concrete_stateful_fn._filtered_call(canon_args, canon_kwds)  # pylint: disable=protected-access
  File "/home/fruggeri/tf2.3/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
    cancellation_manager=cancellation_manager)
  File "/home/fruggeri/tf2.3/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/fruggeri/tf2.3/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 550, in call
    ctx=ctx)
  File "/home/fruggeri/tf2.3/lib/python3.6/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError:  Incompatible shapes: [1,2048,12,1045] vs. [1,2048,12,1025]
     [[node led/encoder/layers.0/self_attn/longformer_self_attn/dropout_1/dropout/Mul_1 (defined at /tf2.3/lib/python3.6/site-packages/transformers/models/led/modeling_tf_led.py:303) ]] [Op:__inference_test_model_27392]

Function call stack:
test_model

Run the torch version of the same script.


from transformers import AutoModelForSeq2SeqLM
import numpy as np
import torch

preloaded_name = 'allenai/led-base-16384' led = AutoModelForSeq2SeqLM.from_pretrained(preloaded_name)

"""

NOTE: same inputs as in the TensorFlow example!

In this example, we have the following shapes: input_length --> 1800 output_length --> 70 """ inputs = np.load('inputs_with_mask.npy', allow_pickle=True).item() inputs = {key: torch.Tensor(value.numpy()).long() for key, value in inputs.items()}

print('Inputs...') for key, value in inputs.items(): print('Key: {0} - Value: {1}'.format(key, value.shape))

""" Prints: input_ids - Value: (1, 1800) attention_mask - Value: (1, 1800) global_attention_mask - Value: (1, 1800) labels - Value: (1, 70) """

Uncomment to set model training mode

led.train()

led_output = led( input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'], labels=inputs['labels'], global_attention_mask=inputs['global_attention_mask'] if 'global_attention_mask' in inputs else None, use_cache=False, return_dict=True, output_hidden_states=True)

Uncomment to test with gradient

led_output['loss'].backward()

optim = torch.optim.SGD(led.parameters(), lr=1e-2, momentum=0.9)

optim.step()



<!-- If you have code snippets, error messages, stack traces please provide them here as well.
     Important! Use code tags to correctly format your code. See https://help.github.com/en/github/writing-on-github/creating-and-highlighting-code-blocks#syntax-highlighting
     Do not use screenshots, as they are hard to read and (more importantly) don't allow others to copy-and-paste your code.-->

## Expected behavior

<!-- A clear and concise description of what you would expect to happen. -->

Running the TensorFlow script (step 1) throws the mentioned error regarding the self attention block. 
Interestingly, this error does not occur if we remove ```@tf.function``` from ```def test_model(input):``` function.

On the other hand, the PyTorch version does not present any kind of error, even when considering backward pass and gradient computation.

For clarity: both ```attention_mask``` and ```global_attention_mask``` are binary masks (integers). 

Am I doing something wrong in defining the ```global_attention_mask```?.
Testing the TensorFlow script (step 1) with ```global_attention_mask=None``` works smoothly without any error, even when doing gradient backpropagation.

I've tried changing the ```attention_window_size``` parameter of the model without any particular success (e.g. matching it with my input size), the error is still thrown.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

patrickvonplaten commented 2 years ago

Hey @federicoruggeri,

If I remember correctly, TF LED cannot be compiled with output_attention_mask=True. This is a difficult bug and I don't think we'll be able to allocate time soon to solve this, I'm afraid :-/

If you'd be willing to open a PR and dig deeper into this, I'm happy to help however!