Advanced AI Explainability for computer vision. Support for CNNs, Vision Transformers, Classification, Object detection, Segmentation, Image similarity and more.
I’m currently exploring how to visualize the heatmap on LLAVA or other kinds of multimodal large language model to understand the model’s focus during text generation. I am familiar with using Grad-CAM for single-target classification tasks. However, with LLAVA generating complete sentences, I’m unsure how to obtain heatmaps for individual words. Could you provide any guidance or advice on how to approach this?
Hello,
I’m currently exploring how to visualize the heatmap on LLAVA or other kinds of multimodal large language model to understand the model’s focus during text generation. I am familiar with using Grad-CAM for single-target classification tasks. However, with LLAVA generating complete sentences, I’m unsure how to obtain heatmaps for individual words. Could you provide any guidance or advice on how to approach this?