Closed RuixiangZhao closed 1 year ago
Hi, Ruixiang,
Thanks for your interest in our work and your nice questions! Hope the following answers can help you. 🤗
- Are the text and video embeddings using the same codebooks?
Yes. We build a quantization module shared across text and video embeddings at each of the L+1
heads (i.e., 1
global head plus L
local heads).
- In section 3.6, when calculating the Asymmetric Quantized Contrastive Learning (AQ-CL) loss, why did you consider using raw embeddings on one side and quantized embeddings on the other side? Why aren't both the text and video using quantized embeddings? Have you compared this setting?
Because the asymmetric loss help to optimize the ADC-based retrieval directly. Symmetric Quantized Contrastive Learning (SQ-CL) is more suitable for SDC-based retrieval. The concept of "ADC" and "SDC" will be explained in answer to the following question.
We did not do SDC-based retrieval experiments in our paper, but the modification will not be complicated. Specifically, you can change the following code
to
cls_term = code_cls_term * (1 - smoothing_weight) + feat_cls_term * smoothing_weight
vlad_term = code_vlad_term * (1 - smoothing_weight) + feat_vlad_term * smoothing_weight
Notes:
code_loss_weight
is the weight of SQ-CL loss, which is set 0
(disabled) in all configuration files. smoothing_weight
is a dynamic weight depending on the training iteration. It enables a warm-up strategy for both embedding and quantization learning.In my expectation, SDC will result in relatively inferior results than ADC. 🤔
- In section 3.7, you mentioned
To accelerate the computation, we can set up a query-specific lookup table.
Why is it possible to build such a lookup table in advance, when the actual retrieval scenario should have text queries appearing randomly and in real-time? How is the Ad-hoc text-to-video retrieval done in terms of inference and retrieval?
You raise a good in-depth question! Let me explain in detail.
Suppose there are N D
-dim item embeddings in the database, then for totally Q queries, Brute-force search requires QxNxD
operations.
When compressing embeddings with M
-subcodebook-K
-codeword product quantization (PQ), we devide each D
-dim embedding into M segments, each of which is a (D/M)
-dim sub-embedding quantized with a K
-slot sub-codebook (i.e., dictionary). The quantization process is to approximate the segment with one codeword (i.e., slot) in the sub-codebook (i.e., dictionary). After quantization, we use M Kx(D/M)
-dim sub-codebooks and NxM
codeword (i.e., slot) indices to store the whole database.
In inference, PQ-based search supports two distance/similarity computation strategies, namely symmetric distance computation (SDC) and asymmetric distance computation (ADC).
SDC estimates distances between the quantized query embedding and quantized item embeddings. We can pre-compute the inter-codeword distance as a table w.r.t. each of M sub-codebooks, which takes MxKxKx(D/M)
=> KxKxD
computations. The pre-computation is highly efficient since we have KxK << N
for large-scale retrieval. Subsequently, SDC transforms the distance measurement into a lookup-and-sum task. For Q queries, SDC requires KxKxD
computation in the pre-computation stage and QxNxM
lookup-and-sum operations, significantly cheaper than brute-force search.
ADC estimates distances between the raw query embedding and quantized item embeddings. We can also pre-compute a table to accelerate distance estimation, whereas the table is query-specific. Concretely, we divide the raw embedding into M (D/M)
segments and compute the distances between each segment and the K-slot sub-codebook. The pre-computation yields MxKx(D/M)
=> KxD
computations. Subsequently, for Q queries, ADC requires QxKxD
computations in the pre-computation stage and QxNxM
lookup-and-sum operations. ADC consumes more time than SDC regarding Q >> K
in large-scale retrieval, but it is still much more efficient than brute-force search.
Empirically, ADC is more accurate than SDC for ad-hoc retrieval, so we take ADC by default to balance retrieval relevance and efficiency.
- Regarding the storage capacity of HCQ, you mentioned The proposed HCQ represents an instance by 8 256-bit codes. Could you explain in more detail how you arrived at the 8 256-bit codes for each instance?
We chose 8x256-bit codes because there are 1 global head and 7 local heads by default. For each head, we use a 32-subcodebook-256-codeword (i.e., M=32
and K=256
) PQ module by default.
It is preferable to fix K=256
to encode each codeword index with a byte (i.e., log_2(256)=8
bits). This is a common setting in vector quantization literature. That is to say, the number of bits per head is typically set to M*8
and there are 32*8=256
bits per head.
We have explored the effect of local head numbers, L,
and of sub-codebook numbers, M,
in Figure 6. The experimental logs have been released in this repository. You can refer to our paper and the logs for more details. ☺
Thank you very much for your detailed reply, which answered my questions.
Hi, much thanks for your nice paper and the well-organized open-resourced codebase.
I have some questions about HCQ after reading your paper:
Are the text and video embeddings using the same codebooks?
In section 3.6, when calculating the Asymmetric Quantized Contrastive Learning (AQ-CL) loss, why did you consider using raw embeddings on one side and quantized embeddings on the other side? Why aren't both the text and video using quantized embeddings? Have you compared this setting?
In section 3.7, you mentioned
To accelerate the computation, we can set up a query-specific lookup table
. Why is it possible to build such a lookup table in advance, when the actual retrieval scenario should have text queries appearing randomly and in real-time? How is the Ad-hoc text-to-video retrieval done in terms of inference and retrieval?Regarding the storage capacity of HCQ, you mentioned
The proposed HCQ represents an instance by 8 256-bit codes.
Could you explain in more detail how you arrived at the8 256-bit codes
for each instance?Looking forward to your reply.