Open lannelin opened 3 years ago
@lannelin, to be clear, is softmax the only non-linearity activation type used in the model ? Do you also have ReLU-s used in the model ? How did you perform softmax change ? Would you, please, point me out to that change ?
Hi @NarineK. The change for the softmax is in this commit on my fork https://github.com/lannelin/transformers/commit/1fd1e4a59628a731b24eb3514ef586dc0b075b5f
The model in my example uses a custom GELU function (so not registered with captum) as this let me use a pretrained model. However, I've trained a local imdb model with ReLU and encounter the same softmax problem.
I've also tried replacing the custom GELU function with nn.GELU
and adding this to SUPPORTED_NON_LINEAR
with a mapping to nonlinear
- I'm not sure if this would be the right way to do this? - again the problem still occurs for softmax.
edit, to clarify: the softmax is only used to normalise the attention scores and the GELU/ReLU is used after a dense layer (inside encoder). This occurs at each encoder layer (12 layers in this case). Each encoder layer initialises its own softmax and GELU/ReLU.
@lannelin, I think that we can add GELU to the list of non-linear functions similar to ELU and leaky RELU.
I printed model_wrapper
and I don't see RELU
there. Could you try to define RELU in the constructor of the model with self.relu1 = nn.RELU(), self.relu2 = nn.RELU() ...
and use it in the forward function ?
ah sorry, @NarineK , I don't think I explained that very well.
As I couldn't find any publicly available models using ReLU, the model in the colab notebook was using a custom gelu
function.
I've now updated that (23rd Nov) to have an explicitly instantiated nn.GELU
and have forked the captum repo to add nn.GELU
to SUPPORTED_NON_LINEAR
. This should be available on the original link, above.
Hopefully that achieves roughly the same ends as your suggestion? I note that the DeepLift delta is affected (reduced) but still extremely large!
Thank you for your time on this!
Thank you for the clarification, @lannelin! I think the problem lies in the way we do the normalization in line: https://github.com/pytorch/captum/blob/c5907b53b162a44df3c8c7386b0d192d45163048/captum/attr/_core/deep_lift.py#L965 According to the original paper: https://arxiv.org/pdf/1704.02685.pdf n is the number of classes which usually, and in your example, is the last dimension. In our implementation we took all the elements in the tensor which doesn't look quite right. We need to make the change accordingly and see if that will fix the issue. Thank you for bringing it up.
@lannelin, I tried to modify the piece that computes the normalization. You can try this version but based on my experiments it looks like we receive much smaller delta when we do not subtract the norm. Let me know if you'll get to try it.
# normalizing
n = grad_input[0].shape[-1]
norm = grad_input_unnorm.sum(axis = -1) * 1 / n
norm = norm.view(norm.shape + (1,))
# updating only the first half
new_grad_inp[0] = grad_input_unnorm - norm
return new_grad_inp
Thanks, @NarineK! That looks good to me. It now calculates the mean along the same axis as the softmax, which I think is what should be happening?
I also get a smaller delta when not subtracting the norm (essentially just using the nonlinear
function rather than softmax
in deep_lift.py
). Still very high though! Puzzling!
Thank you for the quick reply, @lannelin! For debugging purposes, perhaps, we can create a smaller model that has similar architecture and play with it ?
Good idea, @NarineK. I've trained a model with just one attention layer ("hidden layer" in huggingface config speak) with 12 attention heads on the imdb dataset. Seems to perform well enough for this purpose on the test set (though only 87% acc).
BertForSequenceClassification(
(bert): BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(30522, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
(dropout): Dropout(p=0.1, inplace=False)
(classifier): Linear(in_features=768, out_features=2, bias=True)
)
Is that the sort of thing you have in mind?
I'll have a play with DeepLift on this model tomorrow and update :) If it looks promising I'll try and push the model somewhere public
The delta is much lower using the model described above and the normalization method suggested in your comment @NarineK (https://github.com/pytorch/captum/issues/519#issuecomment-738580948); delta is -8.5291 and was previously 1294020. with textattack/bert-base-uncased-imdb
model that has 12 hidden layers.
The attributions also make much more sense and have a similar ranking to IntegratedGradients.
I also tried training a model in identical fashion but using 3 hidden layers and the delta increases. I suspect that the issue is magnified at each hidden layer that is passes through - probably due to the softmax?
I've made the 1-hidden-layer model available here if that's useful.
Sorry for the late reply, @lannelin! I just saw your comments. Yes, that's what I meant, just one attention layer is much easier to debug. ...... Ideally summation-to-delta should hold as also described in the paper: https://arxiv.org/pdf/1704.02685.pdf When we move from layer-to-layer during back propagation the contributions are being distributed across different neurons but I'd expect that total contribution score should be preserved. It can be that because of the normalization we are loosing precision points of the contribution scores and the more layers we have the higher is the loss. If you use nonlinear rule instead of softmax rule do you see close to zero delta ?
I'll download your model and try it myself. This is an interesting case.
@lannelin, I tried to reproduce it on a smaller example:
class ReLUDeepLiftModel(nn.Module):
def __init__(self) -> None:
super().__init__()
self.relu1 = nn.ReLU()
self.relu2 = nn.ReLU()
self.softmax = torch.nn.Softmax(dim=-1)
def forward(self, x1, x2, x3=2):
return self.softmax(2 * self.relu1(x1) + x3 * self.relu2(x2 - 1.5))
x1 = torch.tensor([[1.0, 1.0, 0.0]], requires_grad=True)
x2 = torch.tensor([[2.0, 2.0, -1.0]], requires_grad=True)
b1 = torch.tensor([[0.0, 0.0, 0.0]], requires_grad=True)
b2 = torch.tensor([[0.0, 0.0, 0.0]], requires_grad=True)
inputs = (x1, x2)
baselines = (b1, b2)
model = ReLUDeepLiftModel()
dl = DeepLift(model)
attr = dl.attribute(inputs=(x1, x2), target=0)
model(x1, x2)[:, 0] - model(b1,b1)[:, 0], attr[0].sum() + attr[1].sum()
summation to delta seems to work only if I use nonlinear rule for softmax.
Ah, interesting! Thanks! I'm a bit short on time right now but I'll try and play with this in the new year. Happy holidays :)
Hi all,
Thanks for all your amazing work on captum!
Upon modifying a huggingface/transformers BERT model to explicitly initialise softmax in
__init__
as per suggestion in https://github.com/pytorch/captum/issues/347#issuecomment-616864035 I see a massive increase in the magnitude of the DeepLift delta (delta goes from -1.9306 to -12386754. on the same inputs).I appreciate that there are other issues with this model (e.g. hidden activations not being initialised). I'm not sure whether these play a part in the issue. I was hoping to isolate just the softmax in the first instance.
I have created a notebook to demonstrate the issue that uses a fork of the transformers repo. I'm not sure if this is the best way to share/demonstrate. Please let me know if there's a more convenient method. https://colab.research.google.com/drive/1OB4kkTP4I6R9t4XtQFB6braL8cP83nX5?usp=sharing
It's also maybe worth noting that the actual attributions, both before and after this softmax change, are quite misleading (especially in contrast to Integrated Gradients). Though not entirely unexpected given other issues mentioned.
Any advice that you could share would be appreciated!