ismailuddin / gradcam-tensorflow-2

🧰 Grad-CAM implementation using TensorFlow 2.X code. Including guided Grad-CAM and counterfactuals.
43 stars 4 forks source link

Logits or post-softmax? #2

Open mlerma54 opened 1 year ago

mlerma54 commented 1 year ago

Hi @ismailuddin

I looked at your implementation of Grad-CAM and it seems to me that the heapmaps are calculated using gradients of post-sofmax outputs rather than logits (pre-softmax). The last layer of your classifier_model seems to be the "predictions" layer of the ResNet50, which includes a softmax activation, so this would imply that your code uses gradients of post-sofmax ouputs. Is that correct?

The original Grad-CAM tells us to use logits, but I have seen a number of implementations of Grad-CAM that use gradients of post-sofmax outputs, so I wonder if using logits or post-sofmax outputs is optional, or if there are good reasons to use one or the other.

Thank you!

ismailuddin commented 1 year ago

Hi @mlerma54, Thanks for pointing this out, I think this was an error on my part. The final layer, a Dense layer has the softmax activation already built in. I think I was probably assuming there was another tf.keras layer for the softmax operation. I'm wondering if all the cases you've seen so far using the softmax outputs were actually by accident. Certainly from my implementation, the output of GradCAM still seems to work pretty well.

It's not immediately clear to me what the disadvantage of using the softmax outputs is for this calculation. As all softmax does is allow the sum of the predictions to add up to 1, there is probably no major harm...? Do you have any thoughts on this?

mlerma54 commented 1 year ago

It is possible to remove the softmax activation in the "predictions" layer by using the "classifier_activation=None" flag when loading the ResNet50, then you can still add an additional sofmax activation layer if you wish to get predictions as probabilities.

I don't really know if the Grad-CAM implementations I found using post-sofmax outputs are a mistake or just a design choice, certainly most of the time there is not much difference unless the outputs get saturated (close to 100% probability for a class), in which case the gradients may get very small or even vanish. In fact most of the time I have used Grad-CAM in the past I used the "post-sofmax" version and I found problems only once that I can remember. Except for that I find more natural to use post-sofmax outputs since that is what is used to match the target outputs during training. The problem with (the post-sofmax version of) Grad-CAM suffering a sort of "vanishing gradients" problem is, in my view, a weakness of the algorithm. I am now leaning towards the use of some kind of combination of Grad-CAM and Integrated Gradients for robustness.