[5] Per-Pixel Classification is Not All You Need for Semantic Segmentation

paper link : https://arxiv.org/abs/2107.06278

이 논문을 처음 읽고 가장 크게 착각했던 부분이 maskformer은 instance/semantic/panoptic segmentation을 모두 할 수 있는 backbone 모델인 줄 알았다. 하지만 후속 연구인 mask2former를 읽고 나서 maskformer에서는 semantic/panoptic segmentation을 할 수 있는 모델이고, 이중에서 panoptic segmentation이 instance level로 예측을 수행하는 task이기 때문에 semantic level, instance level 모두 가능하다는 이야기였다.

Mask Classification

commonly used for instance-level segmentation tasks [20, 24].
These tasks require a dynamic number of predictions, making application of per-pixel classification challenging as it assumes a static number of outputs.
Omnipresent Mask R-CNN [21] uses a global classifier to classify mask proposals for instance segmentation.
DETR [4] further incorporates a Transformer [41] design to handle thing and stuff segmentation simultaneously for panoptic segmentation [24].
However, these mask classification methods require predictions of bounding boxes, which may limit their usage in semantic segmentation. The recently proposed Max-DeepLab [42] removes the dependence on box predictions for panoptic segmentation with conditional convolutions [39, 44].
However, in addition to the main mask classification losses it requires multiple auxiliary losses.

Method

mask classification을 하는데, 우선 image를 N개의 영억으로 구분하여 최종적으로 N개의 binary mask를 output으로 한다. 다음, 각 N개의 segment들을 K개의 class 중 하나로 분류한다. 따라서 output은

위와 같다. p는 class probability, m은 binary mask이다. ground truth와 매칭하여 classification loss, mask loss를 구해주어야 하는데, 이 때 bipartite matching-based 방법을 사용해서 gt와의 cost가 가장 적은 매칭으로 설정해서 loss를 구한다. 참고로 이 때 매칭이 되지 않는 output들도 있을 수 있기 때문에 “no object”라는 클래스도 필요하다.

OOD붙인 데이터는 y값도 만들어주면 된다. coco의 binary mask + cityscape mask 이렇게 해서 K+2개의 class로 예측하는거다. 하나는 배경, 하나는 OOD.

그럼 학습은 어떻게해야하지? 이미지를 N개의 region으로 나누는데 이 region은 꼭 k랑 같을 필요는 없고 더 크면 좋다. (한 이미지 당 몇 개의 class가 있을지 모르기 때문에 일단 많이 만들어놓는다. 매칭이 안된 부분은 아무것도 해당안되는 N이다.) 카테고리 수가 K라고 하면 N은 no object class 파이를 포함해서 더 크게 구성된다.

Main loss인 mask-cls loss는 cross-entropy classication loss랑 binary mask loss로 구성되어 있다.

Pixel-level module

일반적인 segmentation model(encoder-decoder)를 사용해서 image feature와 pixel embedding을 뽑아준다.

Transformer module (detr과 굉장히 유사하다.)

각 image feature F에 대해 N개의 learnable positional embedding을 학습한다. image feature들을 백터 형테로 만들어서 중간에 넣어주고 positionel embedding 역할을 하는 N개의 query들을 input으로 한다. 다음 decoder를 통과해 N개의 per-segment embedding을 뽑아준다. 여기서 N query들은 learnable parameter 이다.

Segmentation module

Probability predictions

→ linear classifier + softmax function

per-segment embedding에 대해 각 segment에 대한 class probability prediction을 구한다.

이 때 no object category인 파이도 있기 때문에 이 경우에는 어떠한 region해도 해당하지 않는다.
Mask Prediction

→ Multi-layer Perceptron with 2 hidden layers converts the per-segment embeddings to N mask embeddings

MLP layer를 통과시켜서 per-pixel embedding과 dot-product를 해서 N개의 mask prediction을 구한다.

최종적으로 이 둘이 합쳐서저 semantic segmentation 결과가 나오게 된다. 이 때 최종 mask를 구하기 위해 argmax를 사용하는데, class probability와 mask probability가 모두 높아야 선택될 수 있는 방법이다.

sy00n / DL_paper_review