[102] Attention Augmented Convolutional Networks

TL;DR

I read this because.. : DETR이 인용. transformer의 FFN은 1x1 convolution 같아서 encoder가 FFN을 통해 "attention augmented convolutional networks"로 볼 수 있다고 얘기해서 궁금해서 읽음.
task : image classification / object detection
problem : CNN은 local한 정보밖에 못보나 self-attention은 long-range를 볼 수 있다.
idea : 둘이 결합해보자!
architecture : 이미지가 들어오면 (h, w) 차원에서 MSA (hidden vector = channel 차원) 적용. 각 pixel에 대해서는 relative poisitonal embedding. 이걸 Conv 결과랑 Concat 하는게 Attention-augmented convolution
baseline : ResNet50, RetinaNet50, channel wise reweighing(Squeeze-and-Excitation, Gather-Excite), channel / spatial reweighing independently(BAM, CBAM)
data : ImageNet, COCO
evaluation : accuracy / mAP
result : ImageNet / ResNet50에 적용하니 1.3%올랐고, COCO / RetinaNet에 올리니 1.4 mAP 올랐다.
contribution :
limitation / things I cannot understand :