[161] MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks

TL;DR

I read this because.. : 개인 연구 관련 연구
task : VLM 모델들이 vision 또는 language에 너무 치중하지 않는지 측정해보자
problem : 기존의 occulsion + accuracy based 방법론은 어떤 modality에 치중했는지를 정확히 측정하지 못한다.
idea : 모델의 정확도가 아니라 얼마나 모델 예측에 영향을 미쳤는지에 대한 score를 매기자
input/output : {image, text} -> 각 modality에 대한 score(positive, negative, neutral)
architecture : ALBEF, CLIP, LXMERT, 4 VQA models
baseline : task accuracy
data : VQA, GQA, Image-sentence alignment(VQA, GQA), VALSE, FOIL
evaluation : T-SHAP, V-SHAP
result : -
contribution :
etc. :