Closed cyy280113999 closed 2 years ago
We apply normalization on logits, and we call the post-softmax results as scores which in range of [0, 1].
Could you clarify the difference between them or post your code?
Sure.
b, c, h, w = input.size()
assert b == 1
# predication on raw input
logit = self.model_arch(input.cuda())
if class_idx is None:
predicted_class = logit.max(1)[-1]
else:
predicted_class = torch.LongTensor([class_idx])
# origin version
if not sg:
net_fun = lambda x: self.model_arch(x)[:, predicted_class]
# new version , softmax gradient
else:
net_fun = lambda x: F.softmax(self.model_arch(x), 1)[:, predicted_class]
# net_fun(input)
activations = self.activations
b, k, u, v = activations.size()
# score_saliency_map = torch.zeros((1, 1, h, w)).cuda()
scores = torch.zeros((k,)).cuda()
# config with your gpu
parallel_batch = 64
for i in range(0, k, parallel_batch):
j_exclude = i + parallel_batch
# upsampling
mask = activations[0, i:j_exclude, :, :].reshape(parallel_batch, 1, u, v)
mask = F.interpolate(mask, size=(h, w), mode='bilinear', align_corners=False)
# normalize
mask = binarize(mask)
# save the score
scores[i:j_exclude] = net_fun(input * mask).reshape(parallel_batch)
if norm:
scores = F.softmax(scores, 0)
score_saliency_map = (scores.reshape(1, k, 1, 1) * activations).sum(1, keepdim=True)
score_saliency_map = F.interpolate(score_saliency_map, size=(h, w), mode='bilinear', align_corners=False)
if relu:
score_saliency_map = F.relu(score_saliency_map)
score_saliency_map = binarize(score_saliency_map)
else:
score_saliency_map = double_way_normalize(score_saliency_map)
As you can see, 'sg' makes the single real value function of network $\phi(x)$ different. 'norm' decide normalization on scores or not.
Here show examples of class dog in image.
No sg, No norm It means logits as weights. With norm With sg With sg+norm
(1) What is 'sg'? If I understand correctly, it should be identical to without 'sg' + softmax. It's unclear to me why you add such an operation in the procedure.
(2) It makes no sense to apply both 'sg' and 'norm'. The softmax would be conducted twice in such case. Below is an example. This will bring the scores between the different categories closer to the same.
import torch
import torch.nn as nn
input = torch.randn(1,3)
softmax = nn.Softmax(dim=1)
# the assertion will be invalid only if all values are equal in input
assert softmax(input) != softmax(softmax(input))
Please let me know if you have further question, I'm delighted to discuss.
Yes, it's weird. Softmax works on $\mathbb{R}$. Applied to prob between 0 and 1 makes scores further loss discrimination. I just do it to understand the process in the paper.
Keep this issue for "Normalization". It closed.
Hi,
I know this issue was closed, but I'm still a little confused about the two applications of the softmax function. To me, it seems like there are two (separate) places that the softmax could be applied:
My understanding is that these two applications perform different functions. In 1, we are normalizing the models output given a particular perturbed input, and in 2, we are normalizing the weights for each activation map. However, above, you say that it makes no sense to combine sg and norm. Could you elaborate on this a bit further? Are sg and norm somehow actually performing the same function?
There are two things I need to explain.
Let's assume we have only two masks to run. The model has only two classes in logits, and we focus on class one.
1.1 apply softmax on logits
First, look at the mask one. We feed the masked input into model, and get logits back. like this:
$$[z_1,z_2]$$
We need the score of class one but firstly, we apply softmax on them. Get this:
$$[\frac{e^{z_1}}{e^{z_1},+e^{z_2}}, \frac{e^{z_2}}{e^{z_1},+e^{z_2}}]$$
Then we select the first class of them.
$$\frac{e^{z_1}}{e^{z_1},+e^{z_2}}$$
Samely, we get the score of mask two, like:
$$[z_3,z_4]$$
$$[\frac{e^{z_3}}{e^{z_3},+e^{z_4}}, \frac{e^{z_4}}{e^{z_3},+e^{z_4}}]$$
$$\frac{e^{z_3}}{e^{z_3},+e^{z_4}}$$
We put them together as the scores of the masks, like:
$$[\frac{e^{z_1}}{e^{z_1},+e^{z_2}}, \frac{e^{z_3}}{e^{z_3},+e^{z_4}}]$$
1.2 apply softmax on scores
finaly, we should get this:
$$[z1,z2],[z3,z4]$$
$$[z1,z3]$$
$$[\frac{e^{z_1}}{e^{z_1},+e^{z_3}}, \frac{e^{z_3}}{e^{z_1},+e^{z_3}}]$$
they look very similar...
suppose we have apply softmax and get $[p_1,p_3]$ already.
and then we apply softmax on them again, get $[pp_1,pp_3]$
as you know, the input of softmax ranges from negative infinity to positive infinity
if $p_1\ge p2$, then $pp1\ge pp2$. it keeps the maximum. but
$$|pp_1-pp_2|\le |p_1-p_2|$$ .
the probilities get closer to 0.5, the difference becomes smaller.
Thank you so much for your quick response.
So would the following be a correct interpretation of what you said above:
Applying the softmax on the model's output (from a single perturbed image) is similar to (but not exactly the same) as applying the softmax to each activation weight (i.e. the logits for the class of interest across all perturbed inputs). However, it is not appropriate to apply the softmax in both places, because this would "minimize the difference" between distinct values
Hi, haofanwang, the author of score_cam.
I'm a student learning interpretation of CNN. Something confused me about score_cam. I copy a paragraph as follows in the paper "Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks".
3.2. Normalization on Score Each forward passing in neural network is independent, the score amplitude of each forward propagation is unpredictable and not fixed. The relative output value (post-softmax) after normalization is more reasonable to measure the relevance than absolute output value (pre-softmax). Thus, in Score-CAM, we represent weight as post-softmax value, so that the score can be rescaled into a fixed range. ... Normalization operation equips Score-CAM with good class discrimination ability.
What is exactly the Normalization Operation? After reading, I have two ideas.
Is both operation applied? Which of them improved the discrimination power?
BTW, when both were applied , the effect is worse.