Open m-bain opened 2 years ago
train on MSR-VTT 7k, Text -> Video R@1 / Video -> Text R@1 on MSR-VTT 1kA are reported https://paperswithcode.com/sota/video-retrieval-on-msr-vtt-1ka
bs
for validation batch size, it is also the dimension of sim_matrix
in the code
https://github.com/starmemda/CAMoE/blob/390e5ab47db13cba9e9b37934d42665f47e155cf/DSL.py#L9
sim_matrix = sim_matrix_ * F.softmax(sim_matrix_, dim=0) * len(sim_matrix_)
baseline (no DSL, bs=16): 42.4 / 43.5
baseline + DSL (bs=16): 42.3 / 41.8
baseline + DSL (bs=64): 40.8 / 39.4
baseline + DSL (bs=128): 42.0 / 39.9
baseline + DSL (bs=384): 43.0 / 39.8
baseline + DSL (bs=1000): 48.1 / 43.5
It is clear that DSL leverages the mapping of text-video pairs. And it must be the whole or most of the text-video pairs. Otherwise, the re-weighting will not work.
The above code led to low Video -> Text R@1, so I modifies the code according to the paper. Namely, also reweight the v2t side
sim_matrix = sim_matrix_ * F.softmax(sim_matrix_, dim=0) * len(sim_matrix_)
sim_matrix = sim_matrix * F.softmax(sim_matrix_, dim=1) * len(sim_matrix_)
baseline + DSL (bs=1000): 45.7 / 46.3
Well, only v2t side. It seems it does get two matrices according to the paper‘s Figure 1
sim_matrix = sim_matrix_ * F.softmax(sim_matrix_, dim=1) * len(sim_matrix_)
baseline + DSL (bs=1000): 42.4 / 45.5
sim_matrix = sim_matrix_ * F.softmax(sim_matrix_, dim=0) * len(sim_matrix_)
During training, it is done within a mini-batch.
baseline (no DSL, bs=16): 0.3 / 37.6
baseline + DSL (bs=16): 36.5 / 35.1
baseline + DSL (bs=1000): 35.9 / 37.6
Well, I think my reproduction is problematic.
When only using the normal t2v score, we are selecting the best videos that match the input text. If we use v2t score to re-weight t2v score, we are selecting the best text--video pairs that produce the highest mutual similarity scores. Possibly, the latter is more strict. It can also be considered as a re-ranking process of retrieval video results according to v2t score. Or, an ensemble results of t2v and v2t.
The limitation is clear in the issue as @m-bain said. It requires one-one mapping of text-video pairs. It is impossible in most cases.
Strongly recommend this paper: Cross Modal Retrieval with Querybank Normalisation https://arxiv.org/abs/2112.12777. DSL is similar to the "Inverted Softmax" in Section3.4 in this paper. In my view, DSL is a solution to the "hubness problem" but may not suitible for real world case.
@EmiliaKKK I have read it, and I think it is much more practical than Dual-Softmax for it does not need access to all queries.
I have a question about this issue.
I have skimmed though the paper and thought that DSL is pretty similar to QB normalization as @EmiliaKKK pointed out.
However I couldn't find that the prior is also required during inference stage which would make it impossible in real world setting.
Have I just missed the fact that the prior is required during inference stage? or the paper really doesn't mention it?
@mzhaoshuai @m-bain Hi, I'm toddler on this task, TVR. May I beg your help for how to reproduce DSL on CLIP4Clip? I thank you in advance.
I think this trick should not be used at inference. From the code released by him, it only uses a prior in the Loss function part.
Of course, I haven't browsed the complete code, so this is just speculation. If there is any offense, please feel free to communicate
Hi, thanks for your work. I read the paper and the boost of DSL is substantial so it is worthy to find this. However, my main criticism when using this in practice would be: a) at inference requires all text to be queried together, in order to use the prior b) the prior that there is a one-to-one mapping between test set queries and videos is not always true in the real world. You could do this with classification tasks if you know all classes have equal frequency -- however in practice this is not the case. So I think this is an unrealistic setup for text-to-video retrieval, you can have a user spam the text query "boy running" 100 times and this would cause catastrophic results for DSL.
Do you have results when this is used during training but not testing? If it helps in that case it would be good to know