huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
128.29k stars 25.45k forks source link

Correct way to output attentions in BERT Model #31593

Closed JonathanBhimani-Burrows closed 3 days ago

JonathanBhimani-Burrows commented 3 days ago

System Info

Who can help?

No response

Information

Tasks

Reproduction

.

Expected behavior

Hey everyone, Thanks in advance for the help! I'm running into an issue when I want to output the attentions from a BERT model. The model itself has multiple BERT encoders, one which deals with regular tabular data, and one that deals with tabular-timeseries data When I set output_attentions = True, I get a cuda OOM error once the evaluation loop is run (presumably because the attention tensor is still attached to the grad). As a result, I then attempted to detach the grad from the tensor before returning to avoid this. This fixed the memory error, but now, when the eval loop is run, it starts at the same iterations/s as train, but progressively slows down, and ends up at 1 it/s around 30% of the way through validation, meaning that the val loop takes 3x longer than a train epoch. To speed the process up, I don't return all the attentions, but grab only the first and last layer. o1 = outputs[1][0].detach().cpu() o2 = outputs[1][-1].detach().cpu() attns = (o1, o2) output = (logits,) + attns + outputs[2:] return ((loss,) + output) if loss is not None else output

However, even with this, I have the same problem Is this, in theory, the correct way to do this? Or is there another way, such as using a callback, that is recommended/preferred by hugging face? Thanks, Jonathan

amyeroberts commented 3 days ago

Hi @JonathanBhimani-Burrows, thanks for raising an issue!

This is a question best placed in our forums. We try to reserve the github issues for feature requests and bug reports.

The best way to get help (here and on the forums) is to share as minimal example as possible which enables someones to replicate the issue, as well as any other relevant information. For example, knowing things such as the size of the model, the hardware you're running on; how the training and evaluation loop are set up etc.