Accessing gradients of Bart hidden states

thoppe commented 3 years ago

The forums suggested that this be filed as a bug report:

https://discuss.huggingface.co/t/finding-gradients-in-zero-shot-learning/2033/5

The solution to the problem was solved on SO:

https://stackoverflow.com/questions/64823332/gradients-returning-none-in-huggingface-module/64866990#64866990

The question and answer are reproduced below. Filling as an issue as we should be able to compute gradients on output without a monkey-patch. It looks like the transpose is causing it.

Environment info

transformers version: 3.4.0
Platform: Linux-4.15.0-123-generic-x86_64-with-glibc2.27
Python version: 3.8.1
PyTorch version (GPU?): 1.7.0 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: CPU & GPU
Using distributed or parallel set-up in script?: No

Who can help

Bart: @patrickvonplaten

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

[ ] the official example scripts: (give details below)
[x] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details below)

To reproduce

from transformers import pipeline
import torch

model_name = 'facebook/bart-large-mnli'
nlp = pipeline("zero-shot-classification", model=model_name)

responses = ["I'm having a great day!!"]
hypothesis_template = 'This person feels {}'
candidate_labels = ['happy', 'sad']
nlp(responses, candidate_labels, hypothesis_template=hypothesis_template)

This works well! The output is:

{'sequence': "I'm having a great day!!",
 'labels': ['happy', 'sad'],
 'scores': [0.9989933371543884, 0.0010066736722365022]}

What I'd like to do however, is look at the gradients of the input tokens to see which tokens are important. This is in contrast to looking at the attention heads (which is also another viable tactic). Trying to rip apart the internals of the module, I can get the logics and embedding layers:

inputs = nlp._parse_and_tokenize(responses, candidate_labels, hypothesis_template)
predictions = nlp.model(**inputs, return_dict=True, output_hidden_states=True)
predictions['logits']

tensor([[-3.1864, -0.0714,  3.2625],
        [ 4.5919, -1.9473, -3.6376]], grad_fn=<AddmmBackward>)

This is expected, as the label for "happy" is index 0 and the entailment index for this model is 2, so the value of 3.2625 is an extremely strong signal. The label for "sad" is 1 and the contradiction index is 0, so the value of 4.5919 is also the correct answer.

Great! Now I should be able to look at the first embedding layer and check out the gradient with respect to the happy entailment scalar:

layer = predictions['encoder_hidden_states'][0]
layer.retain_grad()
predictions['logits'][0][2].backward(retain_graph=True)

Unfortunately, layer.grad is None.

Solution from StackOverflow

I was also very surprised of this issue. Although I have never used the library I went down and did some debugging and found out that the issue is coming from the library transformers. The problem is comming from from this line :

encoder_states = tuple(hidden_state.transpose(0, 1) for hidden_state in encoder_states)

If you comment it out, you will get the gradient just with some dimensions transposed. This issue is related to the fact that Pytorch Autograd does not do very well on inplace operations as mentioned here.

So to recap the solution is to comment line 382 in modeling_bart.py.

You will get the gradient with this shape T x B x C instead of B x T x C, but you can reshape it as you want later.

patrickvonplaten commented 3 years ago

@joeddav - feel free to ping me again if you're too busy. Leaving it up to you for now :-)

joeddav commented 3 years ago

Hey thanks for opening the detailed issue. As I mentioned this is a Bart issue, nothing specific to zero shot, so I've renamed it to get the right eyes on it.

The problem here is that the hidden states are transposed after they're passed forward in the computation graph (with the exception of the last encoder layer), which means that the hidden states returned are no longer upstream from the logits in the graph and therefore don't have any gradient information. I'm not sure I see a trivial fix though – any ideas @patrickvonplaten? We could just do the transposes inside EncoderLayer.forward instead but would the superfluous transpose ops slow things down?

thoppe commented 3 years ago

At the very least, having an option to return the value before the transpose would allow access to the gradients.

huggingface / transformers