[arXiv 2021] Cross Modal Retrieval with Querybank Normalisation

uhhyunjoo commented 2 years ago

link
paper	Cross Modal Retrieval with Querybank Normalisation
code	papers with code
etc	official web site

uhhyunjoo commented 2 years ago

Abstract

(1) 대규모 학습 데이터셋 (2) neural architecture design 의 발전 (3) 효율적인 inference
이 덕분에, cross-modal retrieval 에서 joint embedding 이 지배적인 접근 방식이 되었다!
그럼에도 불구하고, hubness problem 존재 : a small number of gallery embeddings 가 many queries 의 the nearest neighbours 를 형성한다...
NLP literature 로부터 영감을 받아서, 본 논문에서는 간단하고 효과적인 Querybank Normalisation (QB-Norm) 을 제안한다.
QB-Norm은, embedding space 에서 hubs 를 차지하고 있는 query similarities 를 re-normalise 하는 프레임워크이다.
retraining 없이, retrieval performance 를 향상시킬 수 있다!
이전 연구와는 다르게, QB-Norm 은 test set queries 에 concurrent access 하지 않고도 효과적으로 작동한다.
또한 QB-Norm framework 에서, 새로운 similarity normalisation method 인 Dynamic Inverted Softmax 를 제안한다.
현존하는 방식들 보다, significantly more robust 하다.
다양한 cross modal retrieval 모델들과 벤치마크에 대해서, sota 를 능가했다.

uhhyunjoo commented 2 years ago

Introduction

Cross Modal Retrieval

어떤 modality 의 쿼리를 이용해서, 다른 modality 에서 a gallery of samples 를 찾는 것
즉, cross modal embeddings 를 사용해서, natural language queries 로 images, audio, videos 를 찾는다.
the dominant cross modal embedding paradigm : employing deep neural networks that project modality-specific samples into a high-dimensional, real-valued vector space in which they can be directly compared via an appropriate distance metric
A key challenge : intrinsic to such high-dimensional spaces, is the emergence of "hubs" paper
hubs : embedding vectors that appear amongst the nearest neighbour sets of disproporionately many other embedding vectors.
Section 3.2 와 Fig 2 를 통해, hubness 가 retrieval methods 에 널리 퍼져있다 (prevalent) 는 것을 보여줄 것이다.
Hubs 를 그냥 두면, retrieval 할 때 search ranking 에서 a significant degradation 이 나타난다 ㅠㅠ
이전 연구에서 이거 해결하려고 여러 방법을 사용했었고, 본 논문에서는 이러한 각각의 메소드들이, QB-Norm 이라는 a single unifying conceptual framework 아래에서 어떻게 해석될 수 있는지! 보여주는 것이다.
QB-Norm 은, inference 할 때 a querybank of samples 를 사용해서 hubs에 있는 gallery 의 influence 를 줄인다.

Two challenges

본 논문에서는, 현존하는 방법들이 두 가지 challenges 를 겪고 있다는 것을 발견했다.
1. multiple test queries 일 때만 잘 작동하는데, 이러한 가정은 실제 세계의 retrieval systems 에서는 실용적이지 않다 (누가 검색어를 여러개 쓰냐;;)
2. querybank 선택에 sensitive 하다. 그리고, 특정 querybank 가 선택되면 성능에 해를 입히기도 한다.
1번 챌린지 해결하기 위해 -> 실험을 통해, QB-Norm 이 effective 해지기 위해서 test querys 에 concurrent access 하는 것이 필요하지 않다는 것을 보여줬다.
2번 챌린지 해결 위해 -> 새로운 normalisation 메소드, Dynamic Inverted Softmax (DIS) 제안 했다. (QB-Norm 프레임워크 안에서 하나의 모듈로 작동)

contributions

Retrieval 할 때 cross modal embeddigns 에서의 hubness 가 a significant concern 이라는 것을 보임
QB-Norm 제안 : fine-tuing 없이, model 의 retrieval performance gain 을 이끄는 a simple non-parametric 프레임워크
현재의 query 이외의 test queries 에 no access 해도 Querybank Normalisation 이 효과적임
Dynamic Inverted Softamx 제안 : a novel normalisatoin method for QB-Norm, 이전 연구들 보다 more robust
QB-Norm is highly effective across a broad range of tasks, models, and benchmarks.

uhhyunjoo commented 2 years ago

Task Definition

주어진 데이터 : modality m_q 인 a query q, modality m_g 인 a gallery of samples
cross modality retrieval 의 목표 : query 에 얼마나 잘 match 되는지에 따라 gallery samples 들에게 rank 를 매기는 것이다.
cross modal embeddings : 각 modality 에 있는 q 와 g 를, 학습한 a pair of encoder 를 이용하여 a shared real-embedding space (R^C) 로 위치하게 만든다. 이때, 인코딩된 것들은 q 와 g 가 비슷하면 거리 상 가깝게 위치하게 된다.
학습 데이터 T : 서로 corresponding 하는 query and gallery samples {(q_i, gi)}^T{i = 1} 를 이용하여 embeddings 를 학습한다.
본 논문에서는 cross modal retrieval 중, natural language queries 를 사용하는 tasks 에 집중한다.
이유1 : 이러한 tasks 들은 hubness 를 완화하는 것에 대해 limited attention 을 받았다.
이유2 : hubness 는 high intrinsic dimensionality 를 가진 embeddings 에서 널리 퍼져있음을 보여주고 있다.
그래서, natural language queries 가 individual words 보다 더 complex 한 conceps 이니까, natural alanguage queirs 가 더 greater 한 intirinsic dimensionality 를 보여줄 것으로 예상된다. 이를 통해, hubness mitigation 으로부터 더 이득을 볼 것이라는 potentional 이 있다.

uhhyunjoo commented 2 years ago

Motivation

high-dimensional embedding spaces 가 hubness 를 갖는 경향이 있다는 것은 꽤 오랫동안 관찰되어 왔다
- 논문 참고 : Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
hubness : 작은 비율의 samples 가, 모든 embeddings 의 k-neareset neighbours set 중에서 불균형적으로 자주 나타나는 것
이는 retrieval system 의 결과에 악영향을 끼친다...
본 논문에서는, 이 문제를 설명하려고, natural language queries 를 사용하는 video retrieval 의 문제를 고려해봤다.
MSR-VTT text-video retrieval 에서, 각 gallery video 가 검색되는 distribution 을 plot 해봤다.
- 방법론 : CE, TT-CE+, MMT, CLIP2Video (이때 당시 sota)
이를 통해, hubness 에 대한 striking evidence 를 발견 했다 : a small number of videos 가 엄청 잦주 검색되었던 것임! 반면에 다른 비디오들은 아예 검색되지 않았음;
이 현상은 특정 retrieval model 에서만 보이는게 아니었는데, 따라서 이게 multiple video modalities, attention mechanism, large-scale pretraining implemented in various combinations by these approaches 로 해결될 수 없다는 것임~

uhhyunjoo commented 2 years ago

Querybank Normalisation

hubness effects 는 여러 도메인에서 연구되어 왔다 : Zero-shot learning, NLP, biomedical statistics, music retrieval 등등...
기존 접근방식들 간에 대한 relationships 를 clarify 하기 위해, 이걸 Querybank Normalisation frameworks 로 cast 했다.
Querybank Normalisation frameworks (QB-Norm) : 두 가지 구성요소로 구성됨
- querybank construction
- similarity normalisation
1. Querybank construction
cross modal embedding space 에서 hubness 를 완화시키기 위해, hubs 의 영향력을 최소화하는 방식으로 embeddings 간의 similarities 를 변경하려고 한다!
a querybank of N samples 구성 : B = {b_1, ... , b_N} from the query modality m_q
이 querybank 는, gallary samples 의 hubness 를 측정하는 a probe (조사관, 탐사선?) 의 역할을 할 것이다.

2. Similarity normalisation

Design choices

Querybank Normalisation framework 는, querybank construction 과 similarity normalisation 에서 여러 방법을 선택하는 것을 허용한다.
먼저, NLP 분야에서 hubness mitigation 을 위해 제안된 세 가지 테크닉을 cast 해왔다.
그러고 나서, 본 논문에서 제안하는 Dynamic Inverted Softmax 에 대해 소개할 것이다.

Globally-Corrected (GC) retrieval
- bilingual translation 과 zero-shot learning 에 대한 tasks 를 위해 소개된 접근 방식
- querybank construction 에 적용될 수 있다. : test queries 의 full set 인 Q 로 부터, querybank 를 constructing 함으로써!
- bilingual translation task 에서, authors 는 m_q 로부터의 additional randomly samples collection of instances 를 가져와서, 그들의 querybank 를 supplement 한다. 이를 통해 performance 를 향상시킨다.
- [latex] q[/latex] 와 gallery vector [latex]g_j[/latex] 의 normalised similarity 는, [latex] n_q(j) = -(Rank(s_q(j), p_j) - s_q(j)) \in R [/latex] 이다. 이때, Rank = R x R^N -> {0, ... , N} 는, second argument 의 array of elements 를 고려해서 first argument 의 rank 를 return 한다.
Cross-Domain Similarity Local Sacaling (CSLS)
Inverted Softmax (IS)
Dynamic Inverted Softmax (DIS)
- 실험을 통해서, (sec. 4 에 설명될 예정), 본 연구진은 중요한 "practical issue" 를 발견했다.
- 만약 querybank 가 gallery 를 포함하고 있는 space 를 충분히 cover 하지 못한다면, performance 가 degraded 될 것이다.
- 따라서, unnormalised similarities 의 performance 보다 더 낮아질 것이다!
- 이 issue 를 해결하기 위해서, querybank probe matrix 에 추가로, a gallery activation set 를 precompute 했다.
- [latex] A = {j : i \in \overset{k}{argmax_l} s(b_i, g_l) , i \in {1, ... , N}} [/latex]
- [latex] \overset{k}{argmax_l} f(l) [/latex] : k-max select operator 인데, f(l) 을 최대화하는 l 에 대한 k values 를 return 한다. (k는 hyperparameter 이고, l 은 gallery indices 이다)
- 직관적으로, 이 집합 A 는 querybank probe 가, potential hubs 라고 인식한 gallery vecotors 의 indices 를 포함하고 있다.
- 즉, A 에 속하는 nearest neighbour retrievals 에 inverted softmax 를 activating 하는, Dynamic inverted Softmax

1번에서 이미 다 계산됐으니까, IS 후에 2번에서 추가적으로 계산되어야 하는 건 argmax operation 뿐이다.
다행히도, 이 계산은 precision 에 대한 loss 없이 가능하다. 심지어 scale 이 billions of gallery samples 이어도 그렇다.
sec.4 를 통해, DIS 가 GC, CSLS, IS 보다 더 robust 하다는 것을 알 수 있다. 특히, suboptimal query selection 을 적용해도 performance 가 손상되지 않는다.

uhhyunjoo commented 2 years ago

Experiments

datasets, metrics
우리의 주장을 설명하는 실험 : QB-Norm 은 하나의 test query 이상에 대해 concurrent access 하지 않고도 효과적이다.
querybank size 의 영향에 대해 조사하는 실험
이전 방법에 대해 DIS 를 비교하는 실험
QB-Norm components 에 대한 ablation study 를 하는 실험
여러 모델, tasks, datasets 에 대해 QB-Norm 을 적용함으로써 generality 를 보여주는 실험

uhhyunjoo commented 2 years ago

Datasets

text-video retrieval : MSR-VTT, MSVD, DiDeMo, LSMDC, VaTex, QueryYD
text-image retrieval : MSCoCo
text-audio retrieval : AudioCaps
image-to-image retrieval : CUB-200-2011, Stanford Online Products
각 dataset 에 대한 설명은 supplementary 참고해라~
Evaluation Metrics
R@K : recall at rank K
MdR : median rank
각 study 에 대해, three randomly seeded runs 에 대한 mean and standard deviation 을 report 했다.

uhhyunjoo / paper-notes

[arXiv 2021] Cross Modal Retrieval with Querybank Normalisation #11

Abstract

Introduction

Cross Modal Retrieval

Two challenges

contributions

Task Definition

Motivation

Querybank Normalisation

1. Querybank construction

2. Similarity normalisation

Design choices

Experiments

Datasets

Evaluation Metrics