[85] Dynamic Head: Unifying Object Detection Heads with Attentions

TL;DR

task : object detection
problem : 이전의 연구들은 1) scale-aware : image pyramid, feature pyramid ... 2) spatial-aware : convolution, deformable conv ... 3) task-aware : od with segmentation, two-stage, FCOS(center instead of bbox) ... 한 각각을 개선하는 연구들을 냈지만 세개를 모두 잘 하고자하는 논문은 없었다!
idea : L(=num of feature level) x S(=spatial. W x H) x C(=num of channel, task) 차원에 대해 각각 attention을 걸어주자!
architecture : scale은 1x1 conv에 hard sigmoid -> spatial은 deformable attention 사용 -> task는 c번째 채널 슬라이싱해서 max를 통해 해당 태스크에 on-off 되도록 설정. 이 dynamic head attention을 2-stage나 one-stage에 중간에 끼워넣으면 어디에나 넣을 수 있음.
objective : object detection loss
baseline : Mask-RCNN, Cascade-RCNN, FCOS, ATSS, BorderDet, DETR, ...
data : MS-COCO
result : object detection 모델에 DyHead를 적용하면 성능이 무조건 좋아짐. 거의 SOTA.
contribution : attention을 각 차원에 대해 하면서 다양화.
limitation or 이해 안되는 부분 : 정확히 3번의 attention 결과물의 shape이 그려지진 않넹

이게 보통의 self-attention이라고 한다면 이럻게 L, S, C에 대해 각각 attention하는게 dynamic head!

deformable attention 사용.

$F_c$ : feature map에서 c번째 채널 슬라이싱한거
$\theta$ : L x S차원에 대해 Global average pooling하고 2 fcn -> normalizing -> sigmoid로 thesholding 구현되어있음 (수식에 생략된듯?)
$\alpha$, $\beta$ : 위의 activation thesholding function $theta$의 output.

one stage detector cls subnetwork와 bbox regressor는 매우 다르게 행동한다는 선행연구. 이러한 conventional approach와 다르게 backbone에 Unified branch로 cls, bbox를 예측함. 이는 DyHead덕분!
two stage detector RoI pooling 하기 전에 DyHead 적용