Correct way to output attentions in BERT Model

System Info

transformers version: 4.26.1
Platform: Linux-6.5.0-17-generic-x86_64-with-glibc2.35
Python version: 3.9.18
Huggingface_hub version: 0.13.0
PyTorch version (GPU?): 1.13.1+cu116 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Expected behavior

Hey everyone, Thanks in advance for the help! I'm running into an issue when I want to output the attentions from a BERT model. The model itself has multiple BERT encoders, one which deals with regular tabular data, and one that deals with tabular-timeseries data When I set output_attentions = True, I get a cuda OOM error once the evaluation loop is run (presumably because the attention tensor is still attached to the grad). As a result, I then attempted to detach the grad from the tensor before returning to avoid this. This fixed the memory error, but now, when the eval loop is run, it starts at the same iterations/s as train, but progressively slows down, and ends up at 1 it/s around 30% of the way through validation, meaning that the val loop takes 3x longer than a train epoch. To speed the process up, I don't return all the attentions, but grab only the first and last layer. o1 = outputs[1][0].detach().cpu() o2 = outputs[1][-1].detach().cpu() attns = (o1, o2) output = (logits,) + attns + outputs[2:] return ((loss,) + output) if loss is not None else output

However, even with this, I have the same problem Is this, in theory, the correct way to do this? Or is there another way, such as using a callback, that is recommended/preferred by hugging face? Thanks, Jonathan

huggingface / transformers