[40] AnoVL: Adapting Vision-Language Models for Unified Zero-shot Anomaly Localization

핵심: Training-free Adaptation(TFA), unified domain-aware contrastive state prompting template, Test-Time Adaptation(TTA)

Method

Training-free Adaptation(TFA)

local-aware token computation via value-to-value attention 도입
- CLIP의 경우 global level로 image-text를 비교하며 학습하기 때문에 Q-K-V attention 매커니즘은 patch token들이 global feature를 나타낼 수 있게 함.
- Visual transformer로부터 informative local features를 뽑기 위해 본 논문에서는 attention을 modify해서 추가적인 학습 없이 local patch token을 얻음
- 기존 Q-K-V attention은 아래처럼 LN, Q-K-V projection layer, Projection layer, MLP로 구성되어 있음 (아래 식처럼)
- Query key가 contextual relation을 연관짓는 데에 중요한 역할을 하였고 이전 연구에서 last layer에서 Q-K retrieval이 마치 visual globap discription을 위한 GAP 역할처럼 작용함을 확인함.
- 여기서 patch token의 specific locality를 더 개선하기 위해 각 레이어마다 residual linking을 두어서 novel Value-to-Value self-attention을 제안함.
- visual transformer의 layer를 l이라고 했을 때 아래 식 7처럼 adapt 됨
- Original Q-K-V와 달리 V-V attention은 쿼리랑 키를 value로 대체하고 top MLP 층을 제거함.
- Local anomaly score를 구할 때는 visual transformer 마지막 층의 patch token이 식 2로 feed됨.
- 이러한 V-V attention은 training-free이면서도 ZSAD에 효과적임.

Domain-aware State Prompting

Traditional visual models와 달리 visual language model의 에측은 주로 semantic texts의 영향을 매우 많이 받음
특히 anomaly detection은 fine-grained visual recognition task여서 더 precise specific prompts가 중요하다.
Perfect task-specific prompt 설계는 거의 불가능하고 extensive effort가 필요로 하지만, well-designed template가 task-related concepts을 커버할 수 있는 충분한 prompt를 만들 수 있음.
따라서 prompt engineering을 크게 3가지로 나눔: 1) base prompts 2) contrastive-state prompts 3) domain-aware prompts

Base Prompts
- default prompt를 의미함.
- 이런 base prompt 들을 앙상블하면 zero shot 성능이 향상됨을 실험에서 증명함 -> prompt engineering이 CLIP zero-shot transfer에 중요한 역할을 함
- 특히 여러 프롬포트 앙상블이 text guided classifier의 robustness를 강화한다고 주장
- ”a photo of a [class]”
- ”a cropped photo of a [class]”
- ”a bright photo of a [class]”
Contrastive-state
- 정상 이상 state에 대한 antagonistic concepts을 강조함.
- S_AD, S_AL에서 pair of antagonistic state tokens로 normalized anomaly score 계산함
- Opposing state words("perfect vs imperfect"/ "with flaw vs without flow")로 설계함
- 하지만 anomalies가 unknown으로 가정되므로 이 경우 "broken", "imperfect"와 같은 common state words로 구성함.
- 반면 specific anomalies가 known인 경우, 예를 들어 defect type이 "hole"임을 알고 있는 경우 ("with a hole" vs "without a hole")이렇게 contrastive-state prompt로 설계
Domain-aware(DA) prompting
- CLIP과 downstream tasks간의 도메인 갭을 연결하기 위해 제안됨.
- visual inspectoin에서 사용되는 industrial image처럼 fine-grained vision task의 경우 specific distribution을 가짐
- 이러한 visual tokens를 text token distribution이랑 align하기 위해 domain-aware prompt engineering을 제안해서 specific domain에 adapt함.
- "Industrial photo", "textual photo" 이런식으로 설계함
- domain-agnostic prompt와 반대로 domain-aware prompt engineering이 non-parametric manner로 disribution shift를 제거할 수 있음.

최종적으로 unified template는 위와 같음.
정상에 대한 text token, 이상에 대한 text token이 각각 prompt list로부터 만들어지면 avegared token을 계산해서 AD, AL 수행.

Test-time Adaptation(TTA)

fig 3처럼 Non-linear residual-like adapter를 둠.
visual-langauge alignment를 위해 specific query image에 대해 patch token을 tailor함
여기서 파라미터인 w는 self-supervised tasks처럼 업데이트함.
기존 TTA는 주로 data-augmentation이나 forward propagation 시의 multiple-pass로 수행이 되어와서 real-time anomaly localization에 time-comsuming하다는 한계 존재
따라서 visual token에 perturb를 직접적으로 가함.
쿼리에 대해 adapt된 패치 토큰에 noise-corrupted tokens를 합성함. (가우시안 노이즈)
두 self-supervised discriminative tasks를 통해 w 최적화 1) original과 noise-corrupted toekn을 discriminate함 (CE loss)

2) anomaly localization 수행 : P를 pseudo lalel로 써서 encourage adaptor

최종 loss는 두 loss의 합임.

sy00n / DL_paper_review

[40] AnoVL: Adapting Vision-Language Models for Unified Zero-shot Anomaly Localization #45

Method

Training-free Adaptation(TFA)

Domain-aware State Prompting

Test-time Adaptation(TTA)