problem : the need for many hand-designed components like a non-maximum suppression procedure or anchor generation
that explicitly encode our prior knowledge about the task
idea : predict directly object set with bipartite matching
architecture : CNN + transformer encoder + transformer decoder with object queries(=random PE) + bbox / cls prediction head
objective : IoU loss + CE Loss
baseline : Faster R-CNN
data : COCO
result : SOTA
contribution : transformer based od model without nms!
limitation or 이해 안되는 부분 : longer training time, low performance on small object
paper
TL;DR
Details
notion