(포항공대에서 작성한 논문임)

Abstrack

Weakly Supervised Object Detection(WSOD)는 오직 image-level annotations만으로 학습한 모델을 사용해서 이미지의 objects를 detection 하는 task이다. 최근 SOTA 모델들은 self-supervised instance-level supervision을 활용하지만, weak supervison은 count나 location information을 포함하지 않고 대부분 argmax labeling method를 사용해서 종종 objects의 많은 instance를 무시하게 된다. 이를 완화하기 위해서, 본 논문에서는 'object discovery'라는 multiple instance labeling method를 제안한다. contrastive loss를 통해 sampling시에 어떠한 instance-level information도 사용 불가능한 weakly supervised contrastive loss(WSCL)을 제안한다. 이 loss를 통해 같은 class에 해당하는 embedding vector는 consistency를 가지도록 object discovery를 위한 credible similarity threshold를 구성하는 것을 목표로 한다.

Introduction

Object detection task에서 fine-grained object bounding box annotation을 large dataset에 대해 다 확보하는 것은 너무 time-consuming하다. 따라서 좀더 cost-efficient annotations를 마련하기 위한 시도인 WSOD task가 있었다. (이 때 이미지, point, scribble labels를 사용함.)
그러나, WSOD에서의 한게는 fully-supervised counterparts보다 성능이 여전히 한참 못미친다는 점이다. 여기서는 에 대한 이유를 3가지 정도로 분석했다.

Part domaination : WSOD 모델은 오직 object의 discriminative part에만 집중하기 때문에 근본적으로 WSOD task가 Multiple Instance Learning (MIL) problem에서의 local minima를 불러일으킨다.
Grouped Instances : 같은 카테고리들은 one large proposal로 그룹핑된다. image-level lannotation은 객체의 위치나 개수에 대한 정보 없이 class에 대한 정보만 있기 때문에 단순히 highest-score propoasl를 "pseudo groundtruth"로 간주한다. 이 방법은 false-positives를 피할 수는 있지만 object가 누락되어 less-obvious instances는 무시되는 경우가 많다.
argmax-based 알고리즘은PASCAL VOC 기준 12608개 중에서 7306개 (40%), MS-COCO 데이터 기준 894204 중에서 533396개(60%) missing되는 결과를 보였다.

본 논문에서의 해결

Explores all proposed candidates using a similarity measure to the highest-scoring representation.
We further suggest a weakly supervised contrastive loss (WSCL) to set a reliable similarity threshold.
WSCL encourages a model to learn similar features for objects in the same class, and to learn discriminative features for objects in different classes.
We provide a large number of positive and negative instances for WSCL through three feature augmentation methods suitable for WSOD. -> well-behaved embedding space를 확보함으로써 더 사실적인 pseudo groundtruths를 만들 수 있다.

Method

Multiple Instance Learning Head

위 fig2를 보면, RoI feature vectors를 input으로 넣으면 classification scores와 detection scores를 반환한다. (이 classification score에 softmax 함수를 씌워서 구하고 detection score는 region을 따라 구한다.) Proposal scores는 classification, detection score의 element-wise product 연산 결과이다. c번째 class에 대한 image score는 모든 region에 대한 proposal score를 합친 값이다. image-level의 classification loss L_mil 수식은 아래와 같다.

Refinement Head

refinement head의 역할은 instance-level supervision을 통해 self-supervised training strategy를 통합하는 것이다. M개의 proposal이 있고 C개의 class가 있으면 background class까지 추가해서 C+1 개의 class를 둔다. K-th stage에서의 Instance-level supervision은 이전 stage에서 결정된다. 예를들어 첫 번째 instance classifier는 MIL head output을 supervision 삼는다. Instance-level pseudo labels는 해당하는 proposal을 충분히 높은 스코어로 overlap하고 있으면 1, 아니면 0으로 결정한다. 이 때의 threshold는 0.5이다.

classification loss는 위와 같이 정의된다.

sy00n / DL_paper_review