Closed p16i closed 4 years ago
Hi @heytitle, this is very interesting! Thank you for bringing it up. IMDB classification task is a binary classification, not multi-class (assuming the definition of multi-class when the number of classes are larger than 2 )
We can definitely also try the attribution for sigmoid space, but we will get similar results. At least I'm seeing similar results.
@heytitle I have the same concerns. Why do these methods in Captum use the logits instead of the probability after softmax?
In my view, these methods hold on arbitrary functions(before or after softmax). Is there some insight behind it?
@wangyongjie-ntu , that is an excellent question. Usually when we compute the attribution using the softmax or sigmoid layers as an output layer, we obtain smaller attribution magnitudes.
Specifically, the author's of deeplift paper recommend to compute the attributions for the logit layers: https://arxiv.org/pdf/1704.02685.pdf Section 3.6. Choice of Target Layer discusses that case. Here the authors make the case that if we use the sigmoid, higher logit doesn't translate into a higher attribution score.
In case of IG, in the tutorials above I've noticed that those scores are comparable.
If you see significant differences between the attribution scores w/o and w/ logit it would be interesting to look deeper into the semantics of those features and understand if those contributions make sense.
@NarineK . Smaller attribution is right because the difference of probability is less than it of logits (the completeness holds).
However, when the captum deeplift uses the probability as the function F, the completeness seems to be violated. Here is an example when analyzing the feature importance on a naive Multi-layer perception. With logits, it is OK.
@wangyongjie-ntu , is the last layer a sigmoid or softmax ?
@wangyongjie-ntu Sorry. I should introduce it in detail. The task is from https://www.kaggle.com/iabhishekofficial/mobile-price-classification. Four classes. And I just build a naive MLP on it. The last layer is the softmax for cross-entropy loss. The DeepLIFT works well on logits function, fails in probability function after softmax.
Hi @wangyongjie-ntu, we apply rules on both softmax and sigmoid and we are able to apply the rules only if they are defined through torch.nn in the init of the module. e.g. self.softmax = torch.nn.Softmax()
If we define them through Functionals we will not be able to hook them. Same applies to relus and other non-linearities.
@NarineK Do you mean this kind of code?
`` class Mobile(nn.Module):
def __init__(self):
super(Mobile, self).__init__()
self.model = nn.Sequential(
nn.Linear(20, 16),
nn.ReLU(),
nn.Linear(16, 12),
nn.ReLU(),
nn.Linear(12, 4)
)
self.softmax = nn.Softmax(dim = 1)
def forward(self, x):
fc = self.model(x)
max_factor, _ = torch.max(fc, dim = 1)
max_factor = max_factor.expand(4, len(max_factor)).t()
normed_fc = fc - max_factor
prob = self.softmax(normed_fc)
prob = prob + 10e-10
return prob
``
@wangyongjie-ntu, yes, that's what I meant by defining self.softmax
in the constructor. I see that you also take torch.max. Deeplift implementation overrides the gradients for some non-linearities but not for torch.max. I'm a bit concerned about the torch.max
because the gradients won't be overidden. We override Maxpools but not torch.max because torch.max isn't seen as a layer.
@NarineK I use the probability in the above class, but deepLIFT still computes the feature important w.r.t. the logits.
@wangyongjie-ntu, how do you define DeepLift constructor in that case ? The original question was that it is working with logits but doesn't with softmax ?
@NarineK Thanks very much for your patient reply.
Why do I want to use DeepLIFT on prob? I proposed a draft method on probability and selected the IG, deepLIFT etc. as baselines. To fairly compare the performance, I want to compute the importance score on the probability function.
I just remove the torch.max operation in above snippets and directly feed the fc into torch.nn.softmax. Currently, deepLIFT works.
Hi @wangyongjie-ntu, thank you for the explanation! Nice, directly feeding the fc into torch.nn.softmax sounds like a good idea. Glad that DeepLift worked.
@wangyongjie-ntu, can we close this issue or are you still working on it ?
@NarineK Thanks. We can close.
According to a recommendation from the authors of Integrated Gradients, it is written that we should use the softmax output for Integrated Gradients.
However, on IMDB TorchText Interpret, we use the logit value, i.e. we use
model
as the forward function.Should we update according to be consistent with the recommendation? What do you think?