pytorch / captum

Model interpretability and understanding for PyTorch
https://captum.ai
BSD 3-Clause "New" or "Revised" License
4.92k stars 497 forks source link

What is the best way to extend GradCAM to GradCAM++ and then Smooth GradCAM++? #353

Closed yangfantrinity closed 4 years ago

yangfantrinity commented 4 years ago

Does the package have some function already support GradCAM++ https://arxiv.org/pdf/1710.11063.pdf?

I know NoiseTunnel is the SmoothGrad

vivekmig commented 4 years ago

Hi @yangfantrinity, right now we don't have support for GradCAM++, but we will consider adding it in the future.

You should be able to modify the existing GradCAM implementation to include the changes proposed in GradCAM++. For example, when applying an exponential to the output logit, the proposed GradCAM++ weight coefficient is: image

This can be incorporated into the GradCAM implementation by replacing line 204 in attr/_core/layer/grad_cam.py with something like:

squared_layer_gradients = tuple(
    layer_grad ** 2 for layer_grad in layer_gradients
)
cubed_layer_gradients = tuple(layer_grad ** 3 for layer_grad in layer_gradients)
summed_acts = tuple(
    torch.sum(
        layer_eval,
        dim=tuple(x for x in range(2, len(layer_eval.shape))),
        keepdim=True,
    )
    for layer_eval in layer_evals
)

alphas = tuple(
    squared_layer_gradient
    / ((2 * squared_layer_gradient) + (cubed_layer_gradient * summed_act))
    for squared_layer_gradient, cubed_layer_gradient, summed_act in zip(
        squared_layer_gradients, cubed_layer_gradients, summed_acts
    )
)

# Replace NaNs with 0
for alpha in alphas:
    alpha[alpha!=alpha] = 0

summed_grads = tuple(
    torch.sum(
        alpha * F.relu(layer_grad),
        dim=tuple(x for x in range(2, len(layer_grad.shape))),
        keepdim=True,
    )
    for alpha, layer_grad in zip(alphas, layer_gradients)
)

Note that this applies when taking gradients with respect to an exponential applied to the logit (softmax would be different as described in the paper). This proposed change hasn't been tested thoroughly and may have some issues, but should help get started with adapting the existing GradCAM implementation for GradCAM++.

For SmoothGrad, you should be able to use NoiseTunnel directly on top of GradCAM with the modification. Hope this helps!

yangfantrinity commented 4 years ago

Appreciated for your detailed reply and the codes. @vivekmig

If I may ask two more questions: 1) are we safe to assume exp(logit)? The reason here applying an exponential to the output logit is mainly for the ease of computation? 2) What should be the assumption for efficientnet? After the last convolution layer, there are norm, average, dropout layers: image

Before I got your reply here, which is definitely a more efficient and robust way to implement GradCAM++, I did the following codes after Line 202:

gradients = layer_gradients[0]
activations = layer_evals[0]
b, k, u, v = gradients.size()

alpha_num = gradients.pow(2)
alpha_denom = alpha_num.mul(2) + activations.mul(gradients.pow(3)).view(b, k, u * v).sum(-1).view(b, k, 1, 1)
alpha_denom = torch.where(alpha_denom != 0.0, alpha_denom, torch.ones_like(alpha_denom))

alpha = alpha_num.div(alpha_denom + 1e-7)
positive_gradients = F.relu(logit.exp() * gradients)  # ReLU(dY/dA) == ReLU(exp(S)*dS/dA))
weights = (alpha * positive_gradients).view(b, k, u * v).sum(-1).view(b, k, 1, 1)

undo_gradient_requirements(inputs, gradient_mask)
weights = tuple(weights)

and also replace Line 215 to Line 218 with:

scaled_acts = tuple(
    torch.sum(weight * layer_eval, dim=1, keepdim=True)
    for weight, layer_eval in zip(weights, layer_evals)
)

This is idea is borrowed from https://github.com/1Konny/gradcam_plus_plus-pytorch

I will implement the code you suggested here.

vivekmig commented 4 years ago

Hi @yangfantrinity , no problem!

  1. Yeah, I agree the exponential is mostly for computational convenience. My guess is that adding this exponential would still provide meaningful attribution, but I'm not sure how it would compare with softmax or just the logit.
  2. I think the assumption should still hold, the average pooling and normalization are still linear operations, and dropout shouldn't affect inference-time anyway.

Your modification looks good too, the main difference is we maintain grads and activations as tuples to support layers with multiple inputs / outputs, but in your case (and GradCAM / image classification generally), that's usually not the case.

yangfantrinity commented 4 years ago

Thank you @vivekmig , very nice to have discussion with you here. @NarineK Shall I close this question?

NarineK commented 4 years ago

Thank you everyone! Closing this issue!