Confused about the CAM method?

Thanks for your interest. Yes, this is an interesting phenomenon regarding the usage of GradCAM and CAM during our experiments, and we have several observations. At the beginning of this work, we use GradCAM to calculate the class activation map, and it works well but requires extra computation in backpropagation. The technical operation is a little complex and redundant. Motivated by this, we refer to CAAM (class agnostic activation map) and do some visualization on the audio and visual feature map, and we find that this activation map has paied more attention to the semantically salient areas. To this end, one Conv2d layer is sufficient to summarize these activation cues to produce the reliable class activation maps. Besides the class activations, we also use GradCAM on the audiovisual correspondence score to reveal the crucial visual areas that are related to the audio. This gradient is able to emphasize the most influential visual regions that impact the correspondence score, figuring the sounding objects in the visual scene.

shvdiwnkozbw / Multi-Source-Sound-Localization

Confused about the CAM method? #10