PatchCAM in CUB and ImageNet.

Nastu-Ho commented 1 year ago

PatchCAM refers to the CAM generated by patch token. I feel strange about the activation of patchCAM on the CUB dataset, it seems to have higher activation for the whole image (almost the whole CAM is red). But in particular, the activation of the foreground object is relatively weaker than that of the background. patchCAM does not seem to provide class-specific semantic information on CUB. But on ImageNet, patchCAM seems to be normal again, such as strong foreground object activation and weak background activation, which can provide category-specific semantic information. I wonder why such an interesting phenomenon occurs.

hbai98 commented 1 year ago

PatchCAM refers to the CAM generated by patch token. I feel strange about the activation of patchCAM on the CUB dataset, it seems to have higher activation for the whole image (almost the whole CAM is red). But in particular, the activation of the foreground object is relatively weaker than that of the background. patchCAM does not seem to provide class-specific semantic information on CUB. But on ImageNet, patchCAM seems to be normal again, such as strong foreground object activation and weak background activation, which can provide category-specific semantic information. I wonder why such an interesting phenomenon occurs.

Does the patchCAM refer to the initial attention map from Transformer, e.g. Eq(1) in our main paper?

Nastu-Ho commented 1 year ago

PatchCAM refers to the CAM generated by patch token. I feel strange about the activation of patchCAM on the CUB dataset, it seems to have higher activation for the whole image (almost the whole CAM is red). But in particular, the activation of the foreground object is relatively weaker than that of the background. patchCAM does not seem to provide class-specific semantic information on CUB. But on ImageNet, patchCAM seems to be normal again, such as strong foreground object activation and weak background activation, which can provide category-specific semantic information. I wonder why such an interesting phenomenon occurs.

Does the patchCAM refer to the initial attention map from Transformer, e.g. Eq(1) in our main paper?

patchCAM refers to the semantic map in your paper

hbai98 commented 1 year ago

Hi! It's an interesting question.

It's very common for semantic maps to cover a single large square region, especially if there's a single category present in the image, e.g., Fig. 6. However, as you discovered, the semantic map $\boldsymbol{S}^0$ will play its role to provide class-specific semantic information when there are multiple objects, and it doesn't matter whether CUB or ImageNet is used.

For instance, you may refer to the Sooty_Albatross_0066_796382.jpg in CUB. and the prediction:

This is because the model takes some regions of interest (ROI) in the image to calculate the CLS loss.

If there's one object, the model would not spare effort to distinguish the single object from its background since the background(not labeled in the image) catches a rare amount of attention, i.e., low CLS scores.
If there are more than one object, i.e., the objects mean that they have a relatively large distribution over the training dataset, then the loss will drive the model to find all of these objects and consider them all by computing the CLS scores.

Therefore, you may find it more pleasing to see the class-specific semantic maps in ImageNet, which often contains many objects all labeled in the same image.

BTW, I have updated the GitHub repo, and now it can output semantic maps and attention maps altogether.

Please take a look at the updated README.md in this repo.

hbai98 commented 1 year ago

The semantic map in this image prefers the bird rather than the woman standing near it.

Nastu-Ho commented 1 year ago

Thank you for your patient answers.I learned a lot.

hbai98 / SCM

PatchCAM in CUB and ImageNet. #6