Question about this retrieval setup

m-bain commented 2 years ago

Hi, thanks for your work. I read the paper and the boost of DSL is substantial so it is worthy to find this. However, my main criticism when using this in practice would be: a) at inference requires all text to be queried together, in order to use the prior b) the prior that there is a one-to-one mapping between test set queries and videos is not always true in the real world. You could do this with classification tasks if you know all classes have equal frequency -- however in practice this is not the case. So I think this is an unrealistic setup for text-to-video retrieval, you can have a user spam the text query "boy running" 100 times and this would cause catastrophic results for DSL.

Do you have results when this is used during training but not testing? If it helps in that case it would be good to know

mzhaoshuai commented 2 years ago

train on MSR-VTT 7k, Text -> Video R@1 / Video -> Text R@1 on MSR-VTT 1kA are reported https://paperswithcode.com/sota/video-retrieval-on-msr-vtt-1ka

CLIP4clip not trained with DSL

bs for validation batch size, it is also the dimension of sim_matrix in the code https://github.com/starmemda/CAMoE/blob/390e5ab47db13cba9e9b37934d42665f47e155cf/DSL.py#L9

Only use v2t to re-weight t2v

sim_matrix = sim_matrix_ * F.softmax(sim_matrix_, dim=0) * len(sim_matrix_)

baseline (no DSL, bs=16): 42.4 / 43.5
baseline + DSL (bs=16): 42.3 / 41.8
baseline + DSL (bs=64): 40.8 / 39.4
baseline + DSL (bs=128): 42.0 / 39.9
baseline + DSL (bs=384): 43.0 / 39.8
baseline + DSL (bs=1000): 48.1 / 43.5

It is clear that DSL leverages the mapping of text-video pairs. And it must be the whole or most of the text-video pairs. Otherwise, the re-weighting will not work.

v2t <--> t2v

The above code led to low Video -> Text R@1, so I modifies the code according to the paper. Namely, also reweight the v2t side

sim_matrix = sim_matrix_ * F.softmax(sim_matrix_, dim=0) * len(sim_matrix_)
sim_matrix = sim_matrix * F.softmax(sim_matrix_, dim=1) * len(sim_matrix_)

baseline + DSL (bs=1000): 45.7 / 46.3

Only use t2v to re-weight v2t

Well, only v2t side. It seems it does get two matrices according to the paper‘s Figure 1

sim_matrix = sim_matrix_ * F.softmax(sim_matrix_, dim=1) * len(sim_matrix_)

baseline + DSL (bs=1000): 42.4 / 45.5

CLIP4clip trained with DSL

sim_matrix = sim_matrix_ * F.softmax(sim_matrix_, dim=0) * len(sim_matrix_)

During training, it is done within a mini-batch.

baseline (no DSL, bs=16): 0.3 / 37.6
baseline + DSL (bs=16): 36.5 / 35.1
baseline + DSL (bs=1000): 35.9 / 37.6

Well, I think my reproduction is problematic.

Why DSL works and its limitation

When only using the normal t2v score, we are selecting the best videos that match the input text. If we use v2t score to re-weight t2v score, we are selecting the best text--video pairs that produce the highest mutual similarity scores. Possibly, the latter is more strict. It can also be considered as a re-ranking process of retrieval video results according to v2t score. Or, an ensemble results of t2v and v2t.

The limitation is clear in the issue as @m-bain said. It requires one-one mapping of text-video pairs. It is impossible in most cases.

This reply is just for academic discussion. Not mean to offense.

EmiliaKKK commented 2 years ago

Strongly recommend this paper: Cross Modal Retrieval with Querybank Normalisation https://arxiv.org/abs/2112.12777. DSL is similar to the "Inverted Softmax" in Section3.4 in this paper. In my view, DSL is a solution to the "hubness problem" but may not suitible for real world case.

zkqiu commented 2 years ago

@EmiliaKKK I have read it, and I think it is much more practical than Dual-Softmax for it does not need access to all queries.

ray-twelve commented 2 years ago

I have a question about this issue.

I have skimmed though the paper and thought that DSL is pretty similar to QB normalization as @EmiliaKKK pointed out.

However I couldn't find that the prior is also required during inference stage which would make it impossible in real world setting.

Have I just missed the fact that the prior is required during inference stage? or the paper really doesn't mention it?

celestialxevermore commented 1 year ago

@mzhaoshuai @m-bain Hi, I'm toddler on this task, TVR. May I beg your help for how to reproduce DSL on CLIP4Clip? I thank you in advance.

shallowdream66 commented 1 year ago

I think this trick should not be used at inference. From the code released by him, it only uses a prior in the Loss function part.

Of course, I haven't browsed the complete code, so this is just speculation. If there is any offense, please feel free to communicate

starmemda / CAMoE