I have a quick question about visualizing Grad-CAM.
Do you have any particular reason for using 3rd layer of multimodal encoder?
I've tried other layers using your demo code for visualization, but the results generated from the 4th & 5th layers are quite inaccurate.
Thanks for releasing the code!
I have a quick question about visualizing Grad-CAM. Do you have any particular reason for using 3rd layer of multimodal encoder? I've tried other layers using your demo code for visualization, but the results generated from the 4th & 5th layers are quite inaccurate.
Thanks in advance :)