[21] Integrative Few-Shot Learning for Classification and Segmentation

Introduction

Few-shot Learning은 학습 시에 제한된 수의 샘플을 supervision으로 활용 가능한 문제를 다룬다. Few-shot classification(FS-C)은 각각의 target class에 대해 적은 수의 support set이 존재하는 경우 query image로 부터 target class 중 하나로 분류한다. Few-shot Segmentation(FS-S)의 경우 유사한 셋업에서 query image에 대한 target class region을 segmentation 한다.

기존 연구들의 한계점
- 기존 Few-shot classification(FS-C), Few-shot Segmentation(FS-S)은 서로 유사한 점이 있음에도 불구하고 각각 따로 연구되어 왔다.
- 그리고 기존 FS-C는 query가 항상 target class 중 하나만 포함한다고 가정하는 반면(multiple class를 가지는 상황은 고려하지 않음), FS-S에서는 multiple class를 허용하지만 target class가 없는 경우는 고려하지 않는다. 예를 들어 fig1처럼 어떤 target class없이 query image만 주어진 경우, FS-S learners는 일반적으로 query의 arbitrary object를 segment한다는 문제점을 확인할 수 있다. (support semantics에 해당하지 않는 salient objects를 그냥 맹목적으로 하이라이트 해버림.) 위와 같은 문제를 해결하기 위해서, 본 논문에서는 two few-shot learning문제를 multi-label and background-aware prediction 문제와 결합해서 해결한다.
query image와 target class에 대한 few-shot support set이 주어지면, 각각의 target class의 존재를 확인하고 query에 대한 foreground mask를 예측한다.
기존 FS-C, FS-S와 다르게 classification의 경우 class exclusiveness를 가정하지 않고 segmentation의 경우 모든 target class가 존재한다고 가정하지 않는다.
interegrative few-shot learning(iFSL)은 classification, segmentation 둘 다에서 foreground map을 share한다. 따라서 class-wise foreground map을 공유함으로써 multi-label classification, pixel-wise segmentation을 결합할 수 있다.
attentive sqeeuze mask(ASNet)을 설계해서 query와 support image feature 간의 semantic correlation을 계산하고 strided self-attention을 통해 tensor를 foreground map으로 바꾼다.
이를 통해 multi-layer neural features, global self-attention을 활용함으로써 reliable foreground map을 만들 수 있다.

Multi-label background-aware prediction

conventional FS-C에서는 query를 target class 중 하나로 할당하기 때문에 이 query가 none이나 multiple target class에 할당되는 경우는 다루지 않았다.
따라서 이런 한계를 다루기 위해 background class를 두어서 multi-label classification이 가능하도록 일반화했다.
query와 support image간의 semantic similarities를 비교하고 class-wise occurrences를 구한다. 만약 어떠한 target class도 detect 되지 않을 경우 background class로 분류된다. 이렇게 relaxed constraint 덕분에 꼭 항상 하나의 class로 분류되지 않아도 되기 때문에 좀 더 일반적인 상황을 다룰 수 있다.

Integration of classification and segmentation

FS-CS는 multi-label few-shot classification에 pixel-level spatial reasoning을 적용함으로써 semantic segmentation과 통합한다.
기존 FS-S에서는 query class가 항상 support class set과 정확히 매칭된다고 가정한 반면, FS-CS에서는 query class가 support class의 subset이 될 수 있음을 가정함으로써 기존 가정을 완화한다.
이렇게 통합된 Few-shot learner는 multi-label background-aware class occurrences를 예측할 수 있고 relaxed constraint하에서 동시에 segmentation map 또한 예측할 수 있다.

Method

Integrative Few-Shot Learning(iFSL)은 class tag나 segmentation supervision 중에서 하나를 활용한다. integrated few-shot learner를 f라고 정의했는데 이 f는 인풋으로 쿼리 이미지 x와 서포트 셋 S를 받고, 아웃풋으로 class-wise foreground maps Y를 내뱉는다.

Inference

inference 시에는 top of the set of foreground maps Y에 대한 segmentation masks, class-wise occurrences 둘다 inference를 수행한다.
class-wise occurrences에서는 위처럼 threshold + max pooling에 의해 예측된다. 이 때 average pooling이 multi-label classification에서 small object에 대해 오분류 하는 경향이 있음을 발견했기에 max pooling을 적용했다.

segmentation에서는 class-wise foreground map으로부터 segmentation probability tensor를 얻을 수 있다. N개의 class-wise background map을 episodic background map on the fly로 통합한다. foreground에 속하지 않을 probability maps를 평균내서 episodic background map을 얻고 class-wise foreground map과 concat한다.

그래서 최종 segmentation mask는 식 5처럼 얻어진다. (각 위치 별로 most probable class로 분류됨)

Learning objective

-classification loss는 spatially average-pooled scores와 ground-truth class label간의 BCE loss로 계산된다. Segmentation loss는 각 개별 위치에 대한 class distribution과 ground-truth segmentation annotation 사이의 average cross-entropy로 구성된다. 이 두 loss는 똑같이 분류라는 목적을 가지고 있지만 분류 레벨이 이미지인지, 픽셀인지에 따라 차이가 있다. 둘 중 하나가 선택됨에 따라 학습 시의 supervision level이 정해진다.

Overall flow

ASNet은 input으로 쿼리 이미지와 서포트 이미지 간의 pyramidal cross-correlation tensor를 받는다. (feature pyramids, 이를 hypercorrelations 라고 표현하고 있음) pyramidal correlation은 pyramidal AS layer에 fed되어서 서포트 이미지의 spatial dimension을 gradually squeeze하고 pyramidal output은 bottom-up pathway로 final foreground map과 merge된다. N-way output maps는 parallel하게 연산되고 class-wise foreground map이 계산된다.

Attentive Squeeze Network(ASNet)

Hypercorrelation construction
- NK개의 서포트 이미지와 쿼리 이미지에 대해 NK hypercorrelations를 계산한다. 이 때 resnet50에서의 bottleneck 부분에 해당하는 unit layer에 대해 feature pyramid를 모으고, 이에 해당하는 쿼리와 서포트 feature pyramids 간의 cosine similarity를 구해서 Hq X Wq X Hs X Ws 크기의 4차원 correlation tensor를 얻는다.
- 위 식은 l번째 layer에 대한 각 identical spatial sizes P 별로 그룹된 correlation tensor를 의미한다. 그다음에 각각의 그룹 텐서는 새로운 채널 축을 따라 concat 되어서 hypercorrelation pyramid를 얻게 된다. 이 때의 channel size는 p번째 group의 concat되는 텐서 수에 해당한다.

Attentive squeeze layer (AS layer)

AS 층에서는 strided self-attention을 통해 correlation tensor를 더 작은 support dimension으로 바꾼다. correlation tensor C가 hypercorrelation pyramid로 주어지면, Hq X Wq 크기로 줄인다. AS layer의 목표는 query dimension은 유지되면서 reduced support dimension을 통해 각각의 support correlation tensor에 대한 global context를 분석하는 것이다.

각각의 support correlation에서 전체적인 패턴을 학습하기 위해 correlational feature transform으로 global self-attention mechanism을 적용한다. 이 때 self-attention weights는 모든 query positions를 따라 share되고 병렬처리 된다.
자세히 보면, 일반적인 self-attention 연산처럼 일단 처음에 support correlation tensor C에 대해 T(Target), K, V 각각의 embedding을 만들면서 시작한다. 그리고 아래처럼 T,K로 attention context를 계산한다. 그리고 attention context는 softmax에 의해 normalized 된다. 그다음에 이 attented representation이 MLP layer에 feed 되고, 이 경우 input과 output의 차원이 안맞기 때문에 conv layer에 한번 feed 시키고 나서 다시 MLP layer에 feed 된다.

Multi-layer fusion

pyramid correlational representation은 upsampling, addition, non-linear transformation을 통해 merge 될 수 있다.

먼저 bottommost correlational representation을 adjacent earlier 표현의 공간 차원으로 bilinearly upsample한 다음 two representation을 추가해서 mxied 표현을 얻음
mised representation은 서포트셋의 H'=W'=1이 될 때 까지 two sequential AS layer에 fed된다.
earliest fusion layer의 output은 C차원의 채널을 2(foreground, background)에 매핑하고 출력 공간 크기를 입력 쿼리 이미지 크기에 매핑하는 interleaved 2D convolution and bilinear upsampling으로 구성된 convolutional decoder에 fed된다. 최종적으로 query image size 와 동일한 크기의 spatial size를 가지는 output을 반환하게 된다.

Class-wise foreground map computation

K-shot output foregroud activation map은 각 클래스에 대한 mask prediction을 얻기 위해 평균내어진다. 이 평균낸 output map에 2 channel 방향으로 softmax를 적용해 normalized 시켜서 foreground prediction 확률을 얻는다.

sy00n / DL_paper_review