Large DeepLift delta with BERT explicit softmax init

lannelin commented 3 years ago

Hi all,

Thanks for all your amazing work on captum!

Upon modifying a huggingface/transformers BERT model to explicitly initialise softmax in __init__ as per suggestion in https://github.com/pytorch/captum/issues/347#issuecomment-616864035 I see a massive increase in the magnitude of the DeepLift delta (delta goes from -1.9306 to -12386754. on the same inputs).

I appreciate that there are other issues with this model (e.g. hidden activations not being initialised). I'm not sure whether these play a part in the issue. I was hoping to isolate just the softmax in the first instance.

I have created a notebook to demonstrate the issue that uses a fork of the transformers repo. I'm not sure if this is the best way to share/demonstrate. Please let me know if there's a more convenient method. https://colab.research.google.com/drive/1OB4kkTP4I6R9t4XtQFB6braL8cP83nX5?usp=sharing

It's also maybe worth noting that the actual attributions, both before and after this softmax change, are quite misleading (especially in contrast to Integrated Gradients). Though not entirely unexpected given other issues mentioned.

Any advice that you could share would be appreciated!

NarineK commented 3 years ago

@lannelin, to be clear, is softmax the only non-linearity activation type used in the model ? Do you also have ReLU-s used in the model ? How did you perform softmax change ? Would you, please, point me out to that change ?

lannelin commented 3 years ago

Hi @NarineK. The change for the softmax is in this commit on my fork https://github.com/lannelin/transformers/commit/1fd1e4a59628a731b24eb3514ef586dc0b075b5f

The model in my example uses a custom GELU function (so not registered with captum) as this let me use a pretrained model. However, I've trained a local imdb model with ReLU and encounter the same softmax problem.

I've also tried replacing the custom GELU function with nn.GELU and adding this to SUPPORTED_NON_LINEAR with a mapping to nonlinear - I'm not sure if this would be the right way to do this? - again the problem still occurs for softmax.

edit, to clarify: the softmax is only used to normalise the attention scores and the GELU/ReLU is used after a dense layer (inside encoder). This occurs at each encoder layer (12 layers in this case). Each encoder layer initialises its own softmax and GELU/ReLU.

NarineK commented 3 years ago

@lannelin, I think that we can add GELU to the list of non-linear functions similar to ELU and leaky RELU. I printed model_wrapper and I don't see RELU there. Could you try to define RELU in the constructor of the model with self.relu1 = nn.RELU(), self.relu2 = nn.RELU() ... and use it in the forward function ?

lannelin commented 3 years ago

ah sorry, @NarineK , I don't think I explained that very well.

As I couldn't find any publicly available models using ReLU, the model in the colab notebook was using a custom gelu function. I've now updated that (23rd Nov) to have an explicitly instantiated nn.GELU and have forked the captum repo to add nn.GELU to SUPPORTED_NON_LINEAR. This should be available on the original link, above.

Hopefully that achieves roughly the same ends as your suggestion? I note that the DeepLift delta is affected (reduced) but still extremely large!

Thank you for your time on this!

NarineK commented 3 years ago

Thank you for the clarification, @lannelin! I think the problem lies in the way we do the normalization in line: https://github.com/pytorch/captum/blob/c5907b53b162a44df3c8c7386b0d192d45163048/captum/attr/_core/deep_lift.py#L965 According to the original paper: https://arxiv.org/pdf/1704.02685.pdf n is the number of classes which usually, and in your example, is the last dimension. In our implementation we took all the elements in the tensor which doesn't look quite right. We need to make the change accordingly and see if that will fix the issue. Thank you for bringing it up.

NarineK commented 3 years ago

@lannelin, I tried to modify the piece that computes the normalization. You can try this version but based on my experiments it looks like we receive much smaller delta when we do not subtract the norm. Let me know if you'll get to try it.

    # normalizing
    n = grad_input[0].shape[-1]

    norm = grad_input_unnorm.sum(axis = -1) * 1 / n
    norm = norm.view(norm.shape + (1,))

    # updating only the first half
    new_grad_inp[0] = grad_input_unnorm - norm
    return new_grad_inp

lannelin commented 3 years ago

Thanks, @NarineK! That looks good to me. It now calculates the mean along the same axis as the softmax, which I think is what should be happening?

I also get a smaller delta when not subtracting the norm (essentially just using the nonlinear function rather than softmax in deep_lift.py). Still very high though! Puzzling!

NarineK commented 3 years ago

Thank you for the quick reply, @lannelin! For debugging purposes, perhaps, we can create a smaller model that has similar architecture and play with it ?

lannelin commented 3 years ago

Good idea, @NarineK. I've trained a model with just one attention layer ("hidden layer" in huggingface config speak) with 12 attention heads on the imdb dataset. Seems to perform well enough for this purpose on the test set (though only 87% acc).

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
  (dropout): Dropout(p=0.1, inplace=False)
  (classifier): Linear(in_features=768, out_features=2, bias=True)
)

Is that the sort of thing you have in mind?

I'll have a play with DeepLift on this model tomorrow and update :) If it looks promising I'll try and push the model somewhere public

lannelin commented 3 years ago

The delta is much lower using the model described above and the normalization method suggested in your comment @NarineK (https://github.com/pytorch/captum/issues/519#issuecomment-738580948); delta is -8.5291 and was previously 1294020. with textattack/bert-base-uncased-imdb model that has 12 hidden layers. The attributions also make much more sense and have a similar ranking to IntegratedGradients.

I also tried training a model in identical fashion but using 3 hidden layers and the delta increases. I suspect that the issue is magnified at each hidden layer that is passes through - probably due to the softmax?

I've made the 1-hidden-layer model available here if that's useful.

NarineK commented 3 years ago

Sorry for the late reply, @lannelin! I just saw your comments. Yes, that's what I meant, just one attention layer is much easier to debug. ...... Ideally summation-to-delta should hold as also described in the paper: https://arxiv.org/pdf/1704.02685.pdf When we move from layer-to-layer during back propagation the contributions are being distributed across different neurons but I'd expect that total contribution score should be preserved. It can be that because of the normalization we are loosing precision points of the contribution scores and the more layers we have the higher is the loss. If you use nonlinear rule instead of softmax rule do you see close to zero delta ?

I'll download your model and try it myself. This is an interesting case.

NarineK commented 3 years ago

@lannelin, I tried to reproduce it on a smaller example:

class ReLUDeepLiftModel(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.relu1 = nn.ReLU()
        self.relu2 = nn.ReLU()
        self.softmax =  torch.nn.Softmax(dim=-1)

    def forward(self, x1, x2, x3=2):
        return self.softmax(2 * self.relu1(x1) + x3 * self.relu2(x2 - 1.5))

x1 = torch.tensor([[1.0, 1.0, 0.0]], requires_grad=True)
x2 = torch.tensor([[2.0, 2.0, -1.0]], requires_grad=True)

b1 = torch.tensor([[0.0, 0.0, 0.0]], requires_grad=True)
b2 = torch.tensor([[0.0, 0.0, 0.0]], requires_grad=True)

inputs = (x1, x2)
baselines = (b1, b2)

model = ReLUDeepLiftModel()

dl = DeepLift(model)
attr = dl.attribute(inputs=(x1, x2), target=0)

model(x1, x2)[:, 0] - model(b1,b1)[:, 0], attr[0].sum() + attr[1].sum()

summation to delta seems to work only if I use nonlinear rule for softmax.

lannelin commented 3 years ago

Ah, interesting! Thanks! I'm a bit short on time right now but I'll try and play with this in the new year. Happy holidays :)

pytorch / captum

Large DeepLift delta with BERT explicit softmax init #519