[72] Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity

TL;DR

task : object detection, efficient DETR
problem : deformable DETR은 deformable attention을 통해 쿼리가 주어졌을 때 key를 줄여주지만 multi-scale feature를 쓰기 때문에 encoder input의 토큰 개수가 20배가 되어 inference 속도는 오히려 느리다.
idea : 이미지에는 배경이 많고 salient한 object들만 attention이 들어가면 된다. encoder에 들어가는 token을 sparse하게 만들어보자!
architecture : deformable DETR인데 encoder에 들어가는 input의 objectness를 측정하는 score network를 만듦. 이때 score network는 1) backbone feature map에 detection head를 추가하여 auxiliary loss처럼 학습 또는 2) Decoder Attention Map(DAM): cross attention map에서 크게 잡힌 p%의 token을 1, 나머지를 0으로 둔 pseudo-label로 학습 할 수 있음.
objective : DETR loss + encoder에도 detection head 넣어서 auxiliary loss 추가
baseline : Faster R-CNN, DETR, DETR-DC5, Deformable DETR
data : COCO 2017
result : encoder 토큰의 10%만 쓰더라도 deformable과 비슷한 성능
contribution : more efficient DETR than deformable DETR
limitation or 이해 안되는 부분 :

encoder layer 12는 auxilary loss 없이는 학습이 안됨