[37] SANFlow: Semantic-Aware Normalizing Flow for Anomaly Detection and Localization

Abstract

기존 NF-based methods는 모든 feature를 강제로 하나의 distribution( unit normal distribution)으로 바꾸는데, features는 locally distinct semantic information을 가질 수 있기 때문에 distribution이 다를 수 있음.
따라서 기존 방식은 학습을 어렵게 하고 nomal/abnormal를 구분하는 데 있어 network의 discriminate 역량을 제한함.
본 논문에서는 input image의 각각의 위치에 대한 feature distribution을 different distribution으로 transform 하는 방법을 제안함
- NF 방식으로 주어진 이미지에서의 각각의 위치에서 feature distribution을 매핑하는 것을 학습한다.
- 추가로 discriminability를 강화하기 위해 abnormal data 분포를 정상 데이터와 확실히 다른 분포로 매핑한다.

정상 데이터의 각 location을 평균이 0, 분산은 다 다른 가우시안 분포로 매핑. 즉, locally different distribution으로 feature를 embedding 함
배경과 같은 간단한 지역의 경우 추정한 분산이 작음을 확인했고 좀 더 복잡한 지역에서는 분산이 증가함을 확인.

이 논문에서도 data augmentation(CutPaste) 수행해서 synthesize local anomalies -> NF 학습할 때 anomaly features도 학습해서 정상 특징으로부터 distinct 되는 분포 학습함.
근데 CutPaste에서 좀더 realistic하게 하기 위해서 blurring the borders of extracted patches, 이 패치들에 diverse color jittering values 적용.
패치 크기는 랜덤으로 해서 다양한 abnormal region 얻었고 학습 시에 이 모든 type의 patch랑 정상 데이터랑 동일 비율로 학습함.
인풋 이미지에 대해 locally different base distribution을 가진다고 가정하기 때문에 binay mask M을 각 synthetic anomaly image마다 수행해서 각 pixel location이 정상에 해당하는지 이상에 해당하는지 알도록 함.

multi-scale feature pyramid는 다양한 크기의 anomalies를 다루기 좋음. 각각 다른 스케일은 해당하는 사이즈의 region에 대한 information을 캡쳐할 수 있기 때문임.
따라서 pre-trained CNN으로 k-level feature pyramid 씀. (k=3)

K independent NF model 써서 k-level 피라미드 피쳐에 대해 different scale을 다룸.
spatial information을 다루기 위해 각각의 feature vectors는 해당하는 position embedding vector랑 concat됨.
하지만 이렇게만 해서는 feature vector v에 대해 locally varing base distribution을 만들기에 부족.

semantic-dependent base distribution을 Gaussian distribution with statistics로 인스턴스화 함.
lightweight statistics prediction을 통해 주어진 feature v에 대한 statistics를 추정함. 그런데 평균, 분산을 모두 추정하는 것은 어렵기 때문에 분산만 추정하는 것이 이득이라고 주정.
따라서 정상 region에 대해 평균을 0, 이상에 대해 평균을 1로 고정함. 이렇게함으로써 정상과 이상 간의 minimal overlap -> NF가 정상 이상에 대한 distinct distribution을 학습하는 것을 도움.
샘플이 non i.i.d일 경우 inverse Gamma distribution으로부터 추정함. 이미지 픽셀과 semantic features는 non-d.i.d이기 때문에 아래와 같이 추정

m은 binary mask(indicator)임. 정상이면 0 이상이면 1
정상일 경우 Za가 base distribution을, 이상일 경우 Zn이 base distribution을 represent 하게 됨.
k-th scale의 likelihood를 계산하기 위해 binary mask M이 resize되어서 feature map 크기랑 매칭함. (nearest-neighbor interpolation으로 resize 수행)
최종 loss
inference시에는 test image에 대해 각각 k-th scale feature에 대해 log-likelihood를 계산하고 exponential -> log-likelihood map
이미지 해상도만큼 likelihood map upsample (bilinear interpolation)
여기서 부호 반대로 하면 anomaly score map이 됨.