[39] ANOMALYCLIP: OBJECT-AGNOSTIC PROMPT LEARNING FOR ZERO-SHOT ANOMALY DETECTION

sy00n commented 7 months ago

Abstract

Zero-shot anomaly detection(ZSAD)은 target dataset에 대한 training sample 없이 auxiliary data을 사용해서 학습된 모델을 필요로 한다.
ZSAD는 학습 데이터가 충분치 못한 상황에서 중요하지만, 모델이 다양한 도메인에서의 anomalies에 generalize되는 것은 여전히 challenging한 문제임.
CLIP과 같은 pre-trained vision-language models(VLMs)이 anomaly detection을 포함한 다양한 task에서 zero-shot 성능이 좋았지만, VLMs는 이미지에 대한 abnormality/normality보다 foreground object에 대한 class semantics에만 포커스하게 됨.
따라서 본 논문에서는 여러 도메인에 대해 정확한 ZSAD을 위해 CLIP을 adapt하는 AnomalyCLIP을 제안함.
모델이 object semantics보다 abnormal image regions에 더 집중할 수 있어서 generalized normality and abnormality recognition이 가능해짐.

Introduction

data privacy policies로 인해 학습 데이터에 접근하지 못하거나, target domain과 연관된 학습 데이터가 없는 상황 등에서 ZSAD task가 필요로 하다.
일반적으로 여러 application 시나리오에서 이상은 visual appearance, foreground objects, background features에 상당한 변화가 있기 때문에 모델이 strong generalization ability를 갖추려면 이러한 variation을 고려해야 함.
하지만 기존 CLIP 모델들의 경우 이미지에 대한 정상성/이상성 보다 foreground object에 대한 class semantics을 align하기 위해 학습되었기 때문에 visual abnormality/normality를 이해하는 generability가 떨어진다는 한계가 있음.
또한 현재 prompting approach들을 보면, manually defined prompt나 learnable prompts는 효과적인 object semantic alignments를 위해 global feature에 집중하는 경향이 있어서 fine-grained, local features에서 종종 나타나는 이상은 캡쳐하지 못함.

Method

Object-Agnostic Text Prompt Design

CLIP에서 일반적으로 쓰이는 텍스트 프롬포트 탬플릿 (A photo of a [cls])은 object semantics에 초점을 두기 때문에 정상, 이상 semantics를 캡쳐하는 text embeddings를 만들지 못한다.
따라서 anomaly-discriminative textual embedding을 학습하기 위해 prior anomaly semantics를 text prompt templates와 통합한다.
trivial solution으로는 specific anomaly types로 템플릿을 설계하는 것이겠지만 (A photo of a [cls] with scratches), anomaly의 패턴은 일반적으로 다양하고 unknown이기 때문에 모든 possible anomaly types를 다 리스트하는 것은 어렵다.
따라서 본 논문에서는 damaged [cls] 형태로 설계해서 comprehensive anomaly semantics를 커버하고 scratches, hole과 같은 다양한 defect detection을 가능하게 한다.
그럼에도 불구하고 이러한 text prompt templates는 generic anomaly-discriminating textual embeddings을 만들기 어려울 수 있다.
- CLIP에서의 original pre-training이 이미지 내에서의 정상성/이상성 대신에 object semantics의 align에 초점을 두었기 때문
- 이러한 한게를 해결하기 위해 learnable text prompt templates를 두고 AD-auxiliary D-relevant data를 사용해서 prompt를 tuning 함.
- fine-tuning 단계에서 이 learnable templates가 broad하면서 detail한 정상성/이상성을 학습하게 되고, textual embeddings가 더욱 discriminative하다.
- 따라서 extensive engineering이 필요한 manually defined text prompt templates 즉, object-aware text prompt templates 설계를 안해도 된다.

위 경우가 object-aware text prompt templates인데, [v]랑 [w]는 각각 정상성과 이상성에 대한 learnable word embeddings임.
ZSAD는 unseen target dataset에 대해 이상을 탐지할 수 있어야 함. 하지만, 이런 데이터셋들은 다른 object에 대해 상당한 variations가 존재함. (다른 상품 간, industrial defects와 의료 이미지 간 discrepancies 등)
하지만 object semantics의 substantial difference에도 불구하고 기본 이상 패턴은 유할 수 있음. (metal nuts나 plates의 스크레치, transistors와 PCB의 misplacement 등) 비슷한 anomaly pattern을 공유할 수 있음..
따라서 본 논문에서는 정확한 ZSAD의 핵심은 다른 objects 간의 다양한 semantics에 상관 없는 generic한 이상 패턴을 학습하는 것이다.
따라서 object-aware text prompt templates 설계는 ZSAD에서 불필요하다고 주장. 오히려 unseen anomalies의 detection을 방해할 수 있음.
반면, text prompt templates로부터 object semantics를 배제하는 것이 learnable text prompt templates가 object보다 이상 그 자체의 특성을 캡쳐하는 데 집중할 수 있도록 함.
따라서 object-agnostic prompt learning을 제안함. 이 때 class name을 object로 대체하고 object에 대한 class semantics는 block out함.
이러한 디자인은 object-agnostic text prompt templates가 다른 이상에 대한 공유되는 패턴을 학습할 수 있도록 하고 결과적으로 이렇게 생성된 textual embedding는 더 generic 하고 다른 도메인에서의 다양한 object의 이상을 더 잘 캡쳐함.
또한 어떠한 modification 없이 다른 타겟 도메인에 적용하기 쉬움 (target dataset에 대한 object name이나 anomaly types 정보가 필요 없음)

Global context optimization

효과적으로 object-agnostic text prompts를 학습하기 위해 global, local 각각의 측면에서 정상성과 이상성을 학습할 수 있는 joint optimization approach를 소개함.
Global context optimization
- object-agnostic textual embeddings가 global visual embeddings와 잘 매칭될 수 있도록 도움으로써 global 측면에서 정상/이상 semantics를 더 효과적으로 캡쳐 가능
- Cross entropy loss로 textual embedding이랑 auxiliary data의 visual embedding 간 cosine similiarity match하는 부분임
Local context optimization
- local은 반대로 fine-grained, local abnormal regions를 더 잘 집중할 수 있도록 하는데, vision encoder의 M 번째 중간 레이어에서 local한 정보 가져옴.
- S는 gt segmentation mask여서 1이면 이상, 0이면 정상 픽셀임.
- Focal, Dice loss를 적용.
- focal loss는 일반적으로 이상 지역이 정상 지역보다 더 작기 때문에 imbalance 문제를 해결하기 위함임
- dice loss는 모델의 정확한 decision boundary 학습을 위해 predicted segmentation S_n/S_a 그리고 gt mask 간의 overlap을 측정하는 loss임.

Refinement of textual space

More discriminative textual space를 학습하기 위해 text encoder에 추가적인 learnable token을 추가해서 CLIP의 original textual space를 refine 함.
처음에 random initialized learnable token embeddigs를 clip text encoder에 붙임
그리고 original token embedding이랑 채널 차원을 따라서 concat 해서 CLIP text encoder에 넣음.
self-attention 매커니즘에 의해 t_m+1 이 t'm의 정보를 포함하게 됨.

Refinement of the local visual space

CLIP의 visual encoder가 global object semantics를 align 하기 위해 사전학습 되었기 때문에 self-attentnion 메카니즘으로 global 정보를 local 정보로 propagate하게 될 경우, fine-grained abnormality를 학습하는 것을 방해할 수 있다.
이런 현상을 완화하기 위해 diagonally prominent attention map(DPAM)을 통해 local visual space를 refine 함. (학습 시에 visual encoder는 frozen 됨)
original Q-K attention을 Q-Q, K-K, V-V와 같은 diagonally prominent attention으로 대체하는 것임. 이중에서도 특히 V-V self-attention을 씀.
Fig 3에서 볼 수 있듯이 refined DPAM attention maps가 더 diagonally prominent하고 상당히 개선된 seg map을 보임.

Training and Inference

학습 시에는 auxiliary AD related dataset으로 eq 2 loss로 학습함.
inference 시에는 image-level anomaly score의 경우 similarity score, pixel-level prediction의 경우 anomaly seg map이랑 normal seg map을 interpolation & smoothing 으로 merge 함

Experiments

Datasets and Evaluation Metrics

Dataset: 17개의 공개 데이터셋으로 실험함.
- Industrial inspection: MVTecAD, ViSA, MPDD, BTAD, SDD, DAGM, DTD-Synthetic
- Medical imaging: cancer detection dataset ISBI, colon polyp detection datasets CVC-ClinicDB, CVC-ColonDB, Kvasir, Endo, thyroid nodule detection dataset TN3k, brain tumor detection datasets HeadCT, BrainMRI, Br35H, COVID- 19 detection dataset COVID-19
SOTA와의 비교: CLIP, CLIP-AC, WinCLIP, VAND, CoOp
Metric: AUROC, AP, AUPRO

Main Results

CLIP보다는 Manually defined text prompts를 쓰는 경우 즉, WinCLIP이랑 VAND가 더 좋은 결과를 보임
CoOp의 경우 global feature에만 집중하여서 fine-grained local anomaly semantics는 무시해서 anomaly seg 성능이 좋지 않다고 주장함.

메디컬 도메인의 경우 AnomalyCLIP이랑 VAND가 defect detection dataset으로 tune되었음에도 promising한 ZSAD 결과를 보임.
fig 4를 보면 WinCLIP, VAND 보다 locating도 더 잘 하고 있다고 주장.

industrial dataset에 비해 상대적으로 성능이 안좋은데, 그 이유는 학습에 쓰인 auxiliary data의 impact 때문임.
따라서 medial image를 auxiliary 데이터로 써서 학습하고 ZSAD 성능을 비교했음
하지만 pixel-level annotation 된 이미지가 없기 때문에 ColonDB라는 데이터셋을 직접 만듦.

object-agnostic vs object-aware prompt learning

Image/pixel level 모두 object-agnostic 결과가 더 좋았음.

Ablation

DPAM (T1), object-agnostic text prompts (T2), adding learnable tokens in text encoders (T3), multi-layer visual encoder features (T4)

Conclusion

target dataset에 대한 학습 데이터가 없는 상황에서 ZSAD는 여전히 challenging한 문제이다.
본 논문에서는 AnomalyCLIP을 제안해서 ZSAD의 CLIP의 일반화 성능을 개선했다.
Object-agnostic prompt learning을 통해 다양한 foreground objects를 가지는 이미지 데이터셋에 대해 일반화된 ZSAD을 위한 정상성/이상성 학습이 가능함.
Global local anomaly semantics를 통합하기 위해 joint global and local context optimization 수행함
17개의 공개 데이터셋으로부터 좋은 성능을 증명해 보임.

sy00n commented 7 months ago

궁금한 점 1: text prompt 설계할 때 channel 방향 concat에 대한 ablation이 있는가? 본문에는 일단 없음.

sy00n commented 7 months ago

궁금한 점 2: 그래서 왜 v-v attention이 잘 되는지에 대한 해석, v-v attention 구체적으로 어떻게 적용하는건지 수식이 한줄도 없음 -> 이건 revision 내용으로 확인 가능.

sy00n commented 7 months ago

궁금한 점 3: 이 논문에서 v-v attention을 제안한 게 아닌데 reference가 없어도 되는건가...?

sy00n / DL_paper_review