jacobgil / pytorch-grad-cam

Advanced AI Explainability for computer vision. Support for CNNs, Vision Transformers, Classification, Object detection, Segmentation, Image similarity and more.
https://jacobgil.github.io/pytorch-gradcam-book
MIT License
10.16k stars 1.53k forks source link

Nan in ScoreCAM and 0 value for GradCAMPlusPlus #214

Closed hanwei0912 closed 2 years ago

hanwei0912 commented 2 years ago

Hi,

I have issues when using the ScoreCAM to generate saliency maps for ResNet50. There are Nan values, and I found is it due to the line upsampled = (upsampled - mins) / (maxs - mins) in score_cam.py. I observed that for some images, maxs-mins gets 0. I proposed to add a small number like 1-e7 here to fix this bug.

Besides, I also observed that when using GradCAMPlusPlus to generate saliency maps for Moco-v3-Resnet50 or SwinT, saliency maps only have 0 everywhere. When I got into the code, I found it due to the line cam = np.maximum(cam, 0) in function compute_cam_per_layer of base_cam.py. I observed that GradCAMPlusPlus generates saliency maps with only negative values before normalization. Thus I got 0 values everywhere in the end.

Plus, for the official code of GradCAMPlusPlus, they normalize the saliency map s by s/max(s). But what you do is only keep the positive part and then normalized by s-min(s)/max(s). I did not see why you do in this way. Could you explain the benefit of doing this?

Thank you in advance, Best, Hanwei

jacobgil commented 2 years ago

Hi, In the official implementation ( I think this is the one): https://github.com/adityac94/Grad_CAM_plus_plus/blob/4a9faf6ac61ef0c56e19b88d8560b81cd62c5017/misc/utils.py#L137 It also takes maximum(cam, 0). Then we normalize by scaling to be between 0 and 1. I think both implementations behave the same in this part.

Adding an epsilon to score-cam is a great idea, we should definitely add this.

The negative CAM part can be tricky. It is possible that all the gradients are negative, for example. In that case, the maximum operator will remove everything. For imagenet, usually this doesn't happen, and removing the negative gradients is a good way to clean the image. But "in the wild" it might happen more. I would remove the maximum operator in your case and see how it looks.

Maybe it would make sense to add an argument that controls if negative cam values should be removed or not.

hanwei0912 commented 2 years ago

Hi,

Thank you for your answer!

Yes, I read again the official implementation and paper more carefully. I realized actually the maximum(cam, 0) is used as ReLu. In this sense, keeping it for the CNN-based method is reasonable.

Actually, I also use the imagenet, the negative CAM happens due to using SwinT as the target network. Since we are not combining feature maps but multi-heads, I am not sure if it makes sense to exclude the negative. I need to think about it more carefully.

Thank you again for those explanations.

Hanwei