hila-chefer / Transformer-Explainability

[CVPR 2021] Official PyTorch implementation for Transformer Interpretability Beyond Attention Visualization, a novel method to visualize classifications by Transformer based networks.
MIT License
1.75k stars 232 forks source link

Load locally fine-tuned models #25

Closed lucasresck closed 2 years ago

lucasresck commented 3 years ago

Hello there,

Your work is very nice, thanks for sharing the code 😊

I have been testing your implementation, however I could not make it work with my local models, and I was wondering if someone has some idea or insight that could help me to make it work.

The problem

When I load a locally fine-tuned BERT in BERT explainability example notebook (just changing the name of the community model for the local path of the model), explanations.generate_LRP method returns the following error:

RuntimeError: cannot register a hook on a tensor that doesn't require gradient

I have tested with my local versions of bert-base-uncased and neuralmind/bert-base-portuguese-cased, fine-tuned for multiclass classification.

Complete error output

Running

expl = explanations.generate_LRP(input_ids=input_ids, attention_mask=attention_mask, start_layer=0)[0]

generates the following output:

Complete error output
``` --------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) in () 9 10 # generate an explanation for the input ---> 11 expl = explanations.generate_LRP(input_ids=input_ids, attention_mask=attention_mask, start_layer=0)[0] 12 # normalize scores 13 expl = (expl - expl.min()) / (expl.max() - expl.min()) 16 frames /content/Transformer-Explainability/BERT_explainability/modules/BERT/ExplanationGenerator.py in generate_LRP(self, input_ids, attention_mask, index, start_layer) 28 def generate_LRP(self, input_ids, attention_mask, 29 index=None, start_layer=11): ---> 30 output = self.model(input_ids=input_ids, attention_mask=attention_mask)[0] 31 kwargs = {"alpha": 1} 32 /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 725 result = self._slow_forward(*input, **kwargs) 726 else: --> 727 result = self.forward(*input, **kwargs) 728 for hook in itertools.chain( 729 _global_forward_hooks.values(), /content/Transformer-Explainability/BERT_explainability/modules/BERT/BertForSequenceClassification.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict) 52 output_attentions=output_attentions, 53 output_hidden_states=output_hidden_states, ---> 54 return_dict=return_dict, 55 ) 56 /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 725 result = self._slow_forward(*input, **kwargs) 726 else: --> 727 result = self.forward(*input, **kwargs) 728 for hook in itertools.chain( 729 _global_forward_hooks.values(), /content/Transformer-Explainability/BERT_explainability/modules/BERT/BERT.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, output_attentions, output_hidden_states, return_dict) 628 output_attentions=output_attentions, 629 output_hidden_states=output_hidden_states, --> 630 return_dict=return_dict, 631 ) 632 sequence_output = encoder_outputs[0] /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 725 result = self._slow_forward(*input, **kwargs) 726 else: --> 727 result = self.forward(*input, **kwargs) 728 for hook in itertools.chain( 729 _global_forward_hooks.values(), /content/Transformer-Explainability/BERT_explainability/modules/BERT/BERT.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions, output_hidden_states, return_dict) 131 hidden_states, 132 attention_mask, --> 133 layer_head_mask, 134 ) 135 else: /usr/local/lib/python3.7/dist-packages/torch/utils/checkpoint.py in checkpoint(function, *args, **kwargs) 161 raise ValueError("Unexpected keyword arguments: " + ",".join(arg for arg in kwargs)) 162 --> 163 return CheckpointFunction.apply(function, preserve, *args) 164 165 /usr/local/lib/python3.7/dist-packages/torch/utils/checkpoint.py in forward(ctx, run_function, preserve_rng_state, *args) 72 ctx.save_for_backward(*args) 73 with torch.no_grad(): ---> 74 outputs = run_function(*args) 75 return outputs 76 /content/Transformer-Explainability/BERT_explainability/modules/BERT/BERT.py in custom_forward(*inputs) 123 def create_custom_forward(module): 124 def custom_forward(*inputs): --> 125 return module(*inputs, output_attentions) 126 127 return custom_forward /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 725 result = self._slow_forward(*input, **kwargs) 726 else: --> 727 result = self.forward(*input, **kwargs) 728 for hook in itertools.chain( 729 _global_forward_hooks.values(), /content/Transformer-Explainability/BERT_explainability/modules/BERT/BERT.py in forward(self, hidden_states, attention_mask, head_mask, output_attentions) 507 attention_mask, 508 head_mask, --> 509 output_attentions=output_attentions, 510 ) 511 attention_output = self_attention_outputs[0] /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 725 result = self._slow_forward(*input, **kwargs) 726 else: --> 727 result = self.forward(*input, **kwargs) 728 for hook in itertools.chain( 729 _global_forward_hooks.values(), /content/Transformer-Explainability/BERT_explainability/modules/BERT/BERT.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions) 232 encoder_hidden_states, 233 encoder_attention_mask, --> 234 output_attentions, 235 ) 236 attention_output = self.output(self_outputs[0], h2) /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 725 result = self._slow_forward(*input, **kwargs) 726 else: --> 727 result = self.forward(*input, **kwargs) 728 for hook in itertools.chain( 729 _global_forward_hooks.values(), /content/Transformer-Explainability/BERT_explainability/modules/BERT/BERT.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions) 346 347 self.save_attn(attention_probs) --> 348 attention_probs.register_hook(self.save_attn_gradients) 349 350 # This is actually dropping out entire tokens to attend to, which might /usr/local/lib/python3.7/dist-packages/torch/tensor.py in register_hook(self, hook) 255 return handle_torch_function(Tensor.register_hook, relevant_args, self, hook) 256 if not self.requires_grad: --> 257 raise RuntimeError("cannot register a hook on a tensor that " 258 "doesn't require gradient") 259 if self._backward_hooks is None: RuntimeError: cannot register a hook on a tensor that doesn't require gradient ```

Steps to reproduce the problem

In BERT explainability example notebook, change the name of the model to the path of a locally fine-tuned model uploaded to Google Drive.

More comments

I am able to load the models directly from HuggingFace, and I can even load the downloaded raw weights of bert-base-portuguese-cased without fine-tuning. It could be a problem with my models' fine-tuning process, however I used standard training scripts, and the models have been working so far. Could it be an incompatibility of library versions? Anyway, I will leave the config.json of those models here:

Locally fine-tuned bert-base-uncased
``` { "_name_or_path": "bert-base-uncased", "architectures": [ "BertForSequenceClassification" ], "attention_probs_dropout_prob": 0.1, "gradient_checkpointing": true, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "id2label": { "0": "LABEL_0", "1": "LABEL_1", "2": "LABEL_2", "3": "LABEL_3", "4": "LABEL_4" }, "initializer_range": 0.02, "intermediate_size": 3072, "label2id": { "LABEL_0": 0, "LABEL_1": 1, "LABEL_2": 2, "LABEL_3": 3, "LABEL_4": 4 }, "layer_norm_eps": 1e-12, "max_position_embeddings": 512, "model_type": "bert", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 0, "position_embedding_type": "absolute", "problem_type": "single_label_classification", "transformers_version": "4.8.1", "type_vocab_size": 2, "use_cache": true, "vocab_size": 30522 } ```
Locally fine-tuned neuralmind/bert-base-portuguese-cased
``` { "_name_or_path": "neuralmind/bert-base-portuguese-cased", "architectures": [ "BertForSequenceClassification" ], "attention_probs_dropout_prob": 0.1, "directionality": "bidi", "gradient_checkpointing": true, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "id2label": { "0": "LABEL_0", "1": "LABEL_1", "2": "LABEL_2", "3": "LABEL_3", "4": "LABEL_4", "5": "LABEL_5", "6": "LABEL_6", "7": "LABEL_7", "8": "LABEL_8", "9": "LABEL_9" }, "initializer_range": 0.02, "intermediate_size": 3072, "label2id": { "LABEL_0": 0, "LABEL_1": 1, "LABEL_2": 2, "LABEL_3": 3, "LABEL_4": 4, "LABEL_5": 5, "LABEL_6": 6, "LABEL_7": 7, "LABEL_8": 8, "LABEL_9": 9 }, "layer_norm_eps": 1e-12, "max_position_embeddings": 512, "model_type": "bert", "num_attention_heads": 12, "num_hidden_layers": 12, "output_past": true, "pad_token_id": 0, "pooler_fc_size": 768, "pooler_num_attention_heads": 12, "pooler_num_fc_layers": 3, "pooler_size_per_head": 128, "pooler_type": "first_token_transform", "position_embedding_type": "absolute", "problem_type": "single_label_classification", "transformers_version": "4.8.1", "type_vocab_size": 2, "use_cache": true, "vocab_size": 29794 } ```
hila-chefer commented 3 years ago

Hi @lucasresck, thanks for your interest in our work, and for the detailed description of your issue! Our method is heavily based on the gradients of the attention maps in the model, and it seems like in your case, gradients are not being calculated, and the backwards hooks for the gradients aren’t working. Have you been able to use the model for inference after loading the weights? (Without getting an explanation)

lucasresck commented 3 years ago

Hi, @hila-chefer, thank you for your response 😊

I'm able to use the model for inference, however I have to use BertForSequenceClassification provided from transformers instead from BERT_explainability.modules.BERT, which results in the following error:

Error output
``` /usr/local/lib/python3.7/dist-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None warnings.warn("None of the inputs have requires_grad=True. Gradients will be None") --------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) in () 1 with torch.no_grad(): ----> 2 print(model(input_ids, attention_mask=attention_mask)) 15 frames /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 725 result = self._slow_forward(*input, **kwargs) 726 else: --> 727 result = self.forward(*input, **kwargs) 728 for hook in itertools.chain( 729 _global_forward_hooks.values(), /content/Transformer-Explainability/BERT_explainability/modules/BERT/BertForSequenceClassification.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict) 52 output_attentions=output_attentions, 53 output_hidden_states=output_hidden_states, ---> 54 return_dict=return_dict, 55 ) 56 /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 725 result = self._slow_forward(*input, **kwargs) 726 else: --> 727 result = self.forward(*input, **kwargs) 728 for hook in itertools.chain( 729 _global_forward_hooks.values(), /content/Transformer-Explainability/BERT_explainability/modules/BERT/BERT.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, output_attentions, output_hidden_states, return_dict) 628 output_attentions=output_attentions, 629 output_hidden_states=output_hidden_states, --> 630 return_dict=return_dict, 631 ) 632 sequence_output = encoder_outputs[0] /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 725 result = self._slow_forward(*input, **kwargs) 726 else: --> 727 result = self.forward(*input, **kwargs) 728 for hook in itertools.chain( 729 _global_forward_hooks.values(), /content/Transformer-Explainability/BERT_explainability/modules/BERT/BERT.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions, output_hidden_states, return_dict) 131 hidden_states, 132 attention_mask, --> 133 layer_head_mask, 134 ) 135 else: /usr/local/lib/python3.7/dist-packages/torch/utils/checkpoint.py in checkpoint(function, *args, **kwargs) 161 raise ValueError("Unexpected keyword arguments: " + ",".join(arg for arg in kwargs)) 162 --> 163 return CheckpointFunction.apply(function, preserve, *args) 164 165 /usr/local/lib/python3.7/dist-packages/torch/utils/checkpoint.py in forward(ctx, run_function, preserve_rng_state, *args) 72 ctx.save_for_backward(*args) 73 with torch.no_grad(): ---> 74 outputs = run_function(*args) 75 return outputs 76 /content/Transformer-Explainability/BERT_explainability/modules/BERT/BERT.py in custom_forward(*inputs) 123 def create_custom_forward(module): 124 def custom_forward(*inputs): --> 125 return module(*inputs, output_attentions) 126 127 return custom_forward /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 725 result = self._slow_forward(*input, **kwargs) 726 else: --> 727 result = self.forward(*input, **kwargs) 728 for hook in itertools.chain( 729 _global_forward_hooks.values(), /content/Transformer-Explainability/BERT_explainability/modules/BERT/BERT.py in forward(self, hidden_states, attention_mask, head_mask, output_attentions) 507 attention_mask, 508 head_mask, --> 509 output_attentions=output_attentions, 510 ) 511 attention_output = self_attention_outputs[0] /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 725 result = self._slow_forward(*input, **kwargs) 726 else: --> 727 result = self.forward(*input, **kwargs) 728 for hook in itertools.chain( 729 _global_forward_hooks.values(), /content/Transformer-Explainability/BERT_explainability/modules/BERT/BERT.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions) 232 encoder_hidden_states, 233 encoder_attention_mask, --> 234 output_attentions, 235 ) 236 attention_output = self.output(self_outputs[0], h2) /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 725 result = self._slow_forward(*input, **kwargs) 726 else: --> 727 result = self.forward(*input, **kwargs) 728 for hook in itertools.chain( 729 _global_forward_hooks.values(), /content/Transformer-Explainability/BERT_explainability/modules/BERT/BERT.py in forward(self, hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions) 346 347 self.save_attn(attention_probs) --> 348 attention_probs.register_hook(self.save_attn_gradients) 349 350 # This is actually dropping out entire tokens to attend to, which might /usr/local/lib/python3.7/dist-packages/torch/tensor.py in register_hook(self, hook) 255 return handle_torch_function(Tensor.register_hook, relevant_args, self, hook) 256 if not self.requires_grad: --> 257 raise RuntimeError("cannot register a hook on a tensor that " 258 "doesn't require gradient") 259 if self._backward_hooks is None: RuntimeError: cannot register a hook on a tensor that doesn't require gradient ```

There's also another thing: when I infer logits from my model using your Colab notebook and using transformers' version of BertForSequenceClassification, I receive the following output:

with torch.no_grad():
    print(model(input_ids, attention_mask=attention_mask))
(tensor([[-0.6761, -0.1082,  0.8160,  1.1516, -0.2513, -0.1993, -0.1064, -0.0098,
         -0.3597,  0.6655]], device='cuda:0'),)
/usr/local/lib/python3.7/dist-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")

However, this gradient warning does not appear when I use the model in my local machine:

input_ids = tokenizer('Testando', return_tensors='pt')['input_ids'].to(device)
attention_mask = tokenizer('Testando', return_tensors='pt')['attention_mask'].to(device)

model.eval()
with torch.no_grad():
    print(model(input_ids, attention_mask=attention_mask))
SequenceClassifierOutput(loss=None, logits=tensor([[-0.6761, -0.1082,  0.8160,  1.1516, -0.2513, -0.1993, -0.1064, -0.0098,
         -0.3597,  0.6655]], device='cuda:0'), hidden_states=None, attentions=None)

My local machine runs other versions of the libraries, I'm not sure if this should affect, and I'm also not sure if this warning means something.

I was wondering, have you been able to load local models successfully?

hila-chefer commented 2 years ago

Hi @lucasresck apologies for the delay in response. Was this issue resolved? It seems that for some reason your model does not propagate gradients. Have you tried to see if the input requires gradient?

Edit: also, there are hooks that I’ve inserted in the model layers so maybe try to avoid using torch.no_grad.

lucasresck commented 2 years ago

Hi, @hila-chefer, I'm sorry, I still couldn't solve the issue.

Would it be a problem with my local models? My fine-tuning process is based on this tutorial, do you have any idea of why it wouldn't work with your implementation?

Also, could you reproduce the problem?

About the input, both input_ids and attention_mask in BERT explainability example notebook don't require gradients:

> input_ids, attention_mask
(tensor([[ 101, 2023, 3185, 2001, 1996, 2190, 3185, 1045, 2031, 2412, 2464,  999,
          2070, 5019, 2020, 9951, 1010, 2021, 3772, 2001, 2307, 1012,  102]],
        device='cuda:0'),
 tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
        device='cuda:0'))
hila-chefer commented 2 years ago

Hi @lucasresck, sorry to hear the problem persists :( could you please share the weights you're loading (or any other public model that you are experiencing issues with)? I'm unable to reproduce this issue, and the notebook runs fine for me. Thanks.

lucasresck commented 2 years ago

Hi, @hila-chefer, I've uploaded one of my models to Hugging Face, it's available here.

Thanks in advance :)

hila-chefer commented 2 years ago

Hi @lucasresck, I was able to reproduce, maybe it's an issue with how you saved the weights? Did you encounter this issue for other models too?

lucasresck commented 2 years ago

Hi there,

The models were saved using the method save_pretrained from Transformers:

model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

I also found this issue with the other two models I detailed in my initial comment.

hila-chefer commented 2 years ago

Hi @lucasresck, This is how I saved the weights in our experiment with BERT can you please check to see if the issue still occurs when you use the same code with your model?

hila-chefer commented 2 years ago

Closing due to inactivity, please re-open if necessary