pytorch / captum

Model interpretability and understanding for PyTorch
https://captum.ai
BSD 3-Clause "New" or "Revised" License
4.91k stars 494 forks source link

Comment on IG tutorial on text #407

Closed p16i closed 4 years ago

p16i commented 4 years ago

According to a recommendation from the authors of Integrated Gradients, it is written that we should use the softmax output for Integrated Gradients.

or multi-class classification models, the prediction head is typically a softmax operator on a 'logits' tensor. The attribution must be computed from this softmax output and not the 'logits' tensor. — See [1]'s Identifying the output tensor.

However, on IMDB TorchText Interpret, we use the logit value, i.e. we use model as the forward function.

Should we update according to be consistent with the recommendation? What do you think?

NarineK commented 4 years ago

Hi @heytitle, this is very interesting! Thank you for bringing it up. IMDB classification task is a binary classification, not multi-class (assuming the definition of multi-class when the number of classes are larger than 2 )

We can definitely also try the attribution for sigmoid space, but we will get similar results. At least I'm seeing similar results.

wangyongjie-ntu commented 4 years ago

@heytitle I have the same concerns. Why do these methods in Captum use the logits instead of the probability after softmax?
In my view, these methods hold on arbitrary functions(before or after softmax). Is there some insight behind it?

NarineK commented 4 years ago

@wangyongjie-ntu , that is an excellent question. Usually when we compute the attribution using the softmax or sigmoid layers as an output layer, we obtain smaller attribution magnitudes.

Specifically, the author's of deeplift paper recommend to compute the attributions for the logit layers: https://arxiv.org/pdf/1704.02685.pdf Section 3.6. Choice of Target Layer discusses that case. Here the authors make the case that if we use the sigmoid, higher logit doesn't translate into a higher attribution score.

In case of IG, in the tutorials above I've noticed that those scores are comparable.

If you see significant differences between the attribution scores w/o and w/ logit it would be interesting to look deeper into the semantics of those features and understand if those contributions make sense.

wangyongjie-ntu commented 4 years ago

@NarineK . Smaller attribution is right because the difference of probability is less than it of logits (the completeness holds).
However, when the captum deeplift uses the probability as the function F, the completeness seems to be violated. Here is an example when analyzing the feature importance on a naive Multi-layer perception. With logits, it is OK.

image

NarineK commented 4 years ago

@wangyongjie-ntu , is the last layer a sigmoid or softmax ?

wangyongjie-ntu commented 4 years ago

@wangyongjie-ntu Sorry. I should introduce it in detail. The task is from https://www.kaggle.com/iabhishekofficial/mobile-price-classification. Four classes. And I just build a naive MLP on it. The last layer is the softmax for cross-entropy loss. The DeepLIFT works well on logits function, fails in probability function after softmax.

NarineK commented 4 years ago

Hi @wangyongjie-ntu, we apply rules on both softmax and sigmoid and we are able to apply the rules only if they are defined through torch.nn in the init of the module. e.g. self.softmax = torch.nn.Softmax() If we define them through Functionals we will not be able to hook them. Same applies to relus and other non-linearities.

wangyongjie-ntu commented 4 years ago

@NarineK Do you mean this kind of code?

`` class Mobile(nn.Module):

def __init__(self):
    super(Mobile, self).__init__()
    self.model = nn.Sequential(
            nn.Linear(20, 16),
            nn.ReLU(),
            nn.Linear(16, 12),
            nn.ReLU(),
            nn.Linear(12, 4)
            )
    self.softmax = nn.Softmax(dim = 1)

def forward(self, x):

    fc = self.model(x)
    max_factor, _ = torch.max(fc, dim = 1)
    max_factor = max_factor.expand(4, len(max_factor)).t()
    normed_fc = fc - max_factor
    prob = self.softmax(normed_fc)
    prob = prob + 10e-10

    return prob

``

NarineK commented 4 years ago

@wangyongjie-ntu, yes, that's what I meant by defining self.softmax in the constructor. I see that you also take torch.max. Deeplift implementation overrides the gradients for some non-linearities but not for torch.max. I'm a bit concerned about the torch.max because the gradients won't be overidden. We override Maxpools but not torch.max because torch.max isn't seen as a layer.

wangyongjie-ntu commented 4 years ago

@NarineK I use the probability in the above class, but deepLIFT still computes the feature important w.r.t. the logits.

NarineK commented 4 years ago

@wangyongjie-ntu, how do you define DeepLift constructor in that case ? The original question was that it is working with logits but doesn't with softmax ?

wangyongjie-ntu commented 4 years ago

@NarineK Thanks very much for your patient reply.

Why do I want to use DeepLIFT on prob? I proposed a draft method on probability and selected the IG, deepLIFT etc. as baselines. To fairly compare the performance, I want to compute the importance score on the probability function.

I just remove the torch.max operation in above snippets and directly feed the fc into torch.nn.softmax. Currently, deepLIFT works.

image

NarineK commented 4 years ago

Hi @wangyongjie-ntu, thank you for the explanation! Nice, directly feeding the fc into torch.nn.softmax sounds like a good idea. Glad that DeepLift worked.

NarineK commented 4 years ago

@wangyongjie-ntu, can we close this issue or are you still working on it ?

wangyongjie-ntu commented 4 years ago

@NarineK Thanks. We can close.