shvdiwnkozbw / Multi-Source-Sound-Localization

This repo aims to perform sound localization in complex audiovisual scenes, where there multiple objects making sounds.
78 stars 15 forks source link

Confused about the CAM method? #10

Open jokingww opened 11 months ago

jokingww commented 11 months ago

Nice Work. In the code you get the cam using only one Conv2d layer. According to my intuition, here should not use a method like GradCAM to get CAM. And I also see this process in other place. Can you explain why this this works? Thank you.

shvdiwnkozbw commented 11 months ago

Thanks for your interest. Yes, this is an interesting phenomenon regarding the usage of GradCAM and CAM during our experiments, and we have several observations. At the beginning of this work, we use GradCAM to calculate the class activation map, and it works well but requires extra computation in backpropagation. The technical operation is a little complex and redundant. Motivated by this, we refer to CAAM (class agnostic activation map) and do some visualization on the audio and visual feature map, and we find that this activation map has paied more attention to the semantically salient areas. To this end, one Conv2d layer is sufficient to summarize these activation cues to produce the reliable class activation maps. Besides the class activations, we also use GradCAM on the audiovisual correspondence score to reveal the crucial visual areas that are related to the audio. This gradient is able to emphasize the most influential visual regions that impact the correspondence score, figuring the sounding objects in the visual scene.