[147] Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers

long8v commented 7 months ago

paper, code

TL;DR

I read this because.. : aka. CheferCAM. explainable CLIP score에 관심있어서. 이 논문 레포에서 colab을 공개했는데 토큰별 visualize 결과를 볼 수 있음.
task : explainability in neural network
problem : 전작 TiBA(https://github.com/long8v/PTIR/issues/158) 에서 self-attention 만 말고 multi-modal 환경의 co-attention, enocder-decoder 구조도 하고 싶다
idea : 이전의 ouput에 대한 gradient(==LRP)가 아니라 attention map에 대한 gradient를 쓰자
input/output : model // heatmap for text or vision tokens
architecture : ViT, VisualBERT, LXMERT, DETR
baseline : rollout, raw attention, Grad-CAM, Partial LRP, TiBA
evaluation : perturbation(both in image and text token for VisualBERT), weakly, semantic segmentation
result : 전작 대비 나은 성능
contribution : cross-attention, co-attention 도 explainable하게 한 work. ICCV oral 임
etc. : 앞에 deep taylor decomposition이다 뭐다 피곤했는데 그거 무시하고 이 논문만 읽으면 이론적인 내용도 필요 없고 깔끔한듯.. 그리고 성능이 좋음. 대신 반대로 이론적인 내용이 없어서 좀 주먹구구 느낌. CLIP의 경우 최종 output이 embedding일텐데 그럼 CLIPscore에 대한 시각화는 아닌 것 같기도 함..? colab 자세히 봐야할듯.

Details

some notation

i는 이미지 토큰
t는 텍스트 토큰
$A^{tt}$는 text끼리의 self-attenion / $A^{ii}$는 image끼리의 self-attenion
$A^{ti}$는 multi-modal attention interaction

Relevancy initialization

relevancy map을 초기화 / 업데이트 할 거임

SA 전에는 서로 상호작용이 없어서 $R^{ii}$, $R^{tt}$는 identity. $R^{it}$는 zero tensor.

Relevancy update rules

attention map A를 가지고 relavancy를 update할 것임 전작에 따라 head 간 평균을 구하고 gradient를 사용

여기서 $\delta A$는 우리가 시각화하고 싶은 class t에 대한 output인 $y_t$를 A로 미분한 것. 평균을 취하기 전에 positive만 남겨줌(clamp)(이에 대한 이유는 딱히 없고 전작을 따라줌)

self attention에 대한 relevance 업데이트 방식은 아래와 같음 여기서 s는 query token, q는 key token임.

여기서 $R^{xx}$는 두개로 분리할 수 있는데 처음에 초기화한 $I$랑 $I$를 뺀 residual인 $\hat{R}^{xx}$임. $\hat{R}^{xx}$는 gradient를 사용하기 때문에 숫자가 절대적으로 작음. 이를 해결하기 위해 row의 합이 1이 되도록 정규화 해줌.

co-attention / cross-attention의 경우 update rule을 아래와 같이 정의해줌

Obtaining classification relevancies

[CLS] 토큰의 row에 해당하는 relevancy map을 보면 되는데 text 에 대한걸 보려면 $R^{tt}$의 첫번째 row를 보면 되고 image에 대한걸 보려면 $R^{ti}$의 첫번째 row를 보면 됨

Adaptation to attention type

두 modality의 토큰이 concat되어 SA에 들어가는 경우: 전체 $R^{(i+t, i+t)}$에서 [cls] token에 해당하는 row($R^{i+t}$)의 Relevancy map으로 만들 수 있음.
두 modality가 각각 SA 먼저 하고 서로 CA로 정보교환하는 경우(co-attention): 위에서 설명한 propagation을 다 해야 함. 이후 relavancy map은 분류 모델의 relevancy를 보는 것과 같은 방식으로 보면 됨
encoder-decoder구조: cross-attention이 한 방향으로만 이루어지므로 equation 11은 안해도 됨

Result

long8v commented 7 months ago

see more CLIP score

원래 논문은 CLIP은 다루지 않았는데 누가 기말과제로 올려놨다는 듯. 로직은 대충 이렇다.

clip 모델에 image와 text를 넣음
그러면 image차원에서 logit(즉 이 이미지가 text들과의 유사도에 대한 logit) / text차원에서의 logit이 나옴(cosine similarity에 logit_scale 곱한 형태)
이걸 positive pair에 대해 "one-hot" label을 준 뒤에 logits_per_image와 곱해줌 (아래 과정을 text, image 차원에 대해 진행)
- num_tokens x num_tokens(이미지 또는 텍스트 토큰의 seq len) relevance matrix($R$) 초기화
- 이 one-hot vector(logits_per_image)를 각 transformer block의 attention map(num head x seq len x seq len)으로 미분(autograd.grad)한 뒤에 head 차원해서 평균내줌
- 미분한걸 attention map이랑 bmm(hadamard product)하고 R에 더해줌
```
def interpret(image, texts, model, device, start_layer=start_layer, start_layer_text=start_layer_text):
batch_size = texts.shape[0]
images = image.repeat(batch_size, 1, 1, 1)
logits_per_image, logits_per_text = model(images, texts)
probs = logits_per_image.softmax(dim=-1).detach().cpu().numpy()
index = [i for i in range(batch_size)]
one_hot = np.zeros((logits_per_image.shape[0], logits_per_image.shape[1]), dtype=np.float32)
one_hot[torch.arange(logits_per_image.shape[0]), index] = 1.0
one_hot = torch.from_numpy(one_hot).requires_grad_(True)
one_hot = torch.sum(one_hot.cuda() * logits_per_image)
model.zero_grad()
```
image_attn_blocks = list(dict(model.visual.transformer.resblocks.named_children()).values())

if start_layer == -1:

calculate index of last layer

start_layer = len(image_attn_blocks) - 1

num_tokens = image_attn_blocks[0].attn_probs.shape[-1] R = torch.eye(num_tokens, num_tokens, dtype=image_attn_blocks[0].attn_probs.dtype).to(device) R = R.unsqueeze(0).expand(batch_size, num_tokens, num_tokens) for i, blk in enumerate(image_attn_blocks): if i < start_layer: continue grad = torch.autograd.grad(one_hot, [blk.attn_probs], retain_graph=True)[0].detach() cam = blk.attn_probs.detach() cam = cam.reshape(-1, cam.shape[-1], cam.shape[-1]) grad = grad.reshape(-1, grad.shape[-1], grad.shape[-1]) cam = grad * cam cam = cam.reshape(batch_size, -1, cam.shape[-1], cam.shape[-1]) cam = cam.clamp(min=0)mean(dim=1) R = R + torch.bmm(cam, R) image_relevance = R[:, 0, 1:]

text_attn_blocks = list(dict(model.transformer.resblocks.named_children()).values())

if start_layer_text == -1:

calculate index of last layer

start_layer_text = len(text_attn_blocks) - 1

num_tokens = text_attn_blocks[0].attn_probs.shape[-1] R_text = torch.eye(num_tokens, num_tokens, dtype=text_attn_blocks[0].attn_probs.dtype).to(device) R_text = R_text.unsqueeze(0).expand(batch_size, num_tokens, num_tokens) for i, blk in enumerate(text_attn_blocks): if i < start_layer_text: continue grad = torch.autograd.grad(one_hot, [blk.attn_probs], retain_graph=True)[0].detach() cam = blk.attn_probs.detach() cam = cam.reshape(-1, cam.shape[-1], cam.shape[-1]) grad = grad.reshape(-1, grad.shape[-1], grad.shape[-1]) cam = grad * cam cam = cam.reshape(batch_size, -1, cam.shape[-1], cam.shape[-1]) cam = cam.mean(dim=1) R_text = R_text + torch.bmm(cam, R_text) text_relevance = R_text

return text_relevance, image_relevance
```
정리하면 CLIP model에 나오는 logit에 대해 output label을 1로 두고 이에 대한 "attention map"에 대한 미분값을 attention map과 hadamard product하면서 summationg하는 과정이라고 보면 됨!
```

long8v commented 5 months ago

더 이해하기 쉬운 pseudo-code

    def interpret(self, image, texts, model, CLS_idx, device):
        batch_size = 1
        inputs = self.preprocess(text=texts, images=image, padding="max_length", return_tensors="pt")
        inputs = inputs.to(device)
        outputs = model(**inputs, output_attentions=True)
        clip_score = outputs.logits_per_image
        image_attn_blocks = outputs.vision_model_output.attentions
        text_attn_blocks = outputs.text_model_output.attentions
        index = [i for i in range(batch_size)]
        model.zero_grad()

        num_tokens = text_attn_blocks[0].shape[-1]
        R_text = torch.eye(num_tokens, num_tokens, dtype=text_attn_blocks[0].dtype).to(device)
        R_text = R_text.unsqueeze(0).expand(batch_size, num_tokens, num_tokens)
        for i, attn_map in enumerate(text_attn_blocks):
            attn_map_grad = torch.autograd.grad(logits, [attn_map], retain_graph=True)[0].detach()
            attn_map = attn_map.detach()
            attn_map = attn_map.reshape(-1, cam.shape[-1], cam.shape[-1])
            attn_map_grad = attn_map_grad.reshape(-1, grad.shape[-1], grad.shape[-1])
            attn_map = attn_map * attn_map_grad
            attn_map = attn_map.reshape(batch_size, -1, attn_map.shape[-1], attn_map.shape[-1])
            attn_map = attn_map.clamp(min=0).mean(dim=1) 
            R_text = R_text + torch.bmm(cam, R_text)
        text_relevance = R_text
        return R_text[CLS_idx, 1:CLS_idx]

long8v / PTIR