gimpong / WWW22-HCQ

The code for the paper "Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval" (WWW'22, Oral).
16 stars 4 forks source link

Some questions about HCQ #2

Closed RuixiangZhao closed 1 year ago

RuixiangZhao commented 1 year ago

Hi, much thanks for your nice paper and the well-organized open-resourced codebase.

I have some questions about HCQ after reading your paper:

Looking forward to your reply.

gimpong commented 1 year ago

Hi, Ruixiang,

Thanks for your interest in our work and your nice questions! Hope the following answers can help you. 🤗

  • Are the text and video embeddings using the same codebooks?

Yes. We build a quantization module shared across text and video embeddings at each of the L+1 heads (i.e., 1 global head plus L local heads).

  • In section 3.6, when calculating the Asymmetric Quantized Contrastive Learning (AQ-CL) loss, why did you consider using raw embeddings on one side and quantized embeddings on the other side? Why aren't both the text and video using quantized embeddings? Have you compared this setting?

Because the asymmetric loss help to optimize the ADC-based retrieval directly. Symmetric Quantized Contrastive Learning (SQ-CL) is more suitable for SDC-based retrieval. The concept of "ADC" and "SDC" will be explained in answer to the following question.

We did not do SDC-based retrieval experiments in our paper, but the modification will not be complicated. Specifically, you can change the following code

https://github.com/gimpong/WWW22-HCQ/blob/43e796a8ee43d7aafb7b0a7a67457ea49be6e8bb/model/loss.py#L213-L216

to

    cls_term = code_cls_term * (1 - smoothing_weight) + feat_cls_term * smoothing_weight
    vlad_term = code_vlad_term * (1 - smoothing_weight) + feat_vlad_term * smoothing_weight

Notes:

In my expectation, SDC will result in relatively inferior results than ADC. 🤔

  • In section 3.7, you mentioned To accelerate the computation, we can set up a query-specific lookup table. Why is it possible to build such a lookup table in advance, when the actual retrieval scenario should have text queries appearing randomly and in real-time? How is the Ad-hoc text-to-video retrieval done in terms of inference and retrieval?

You raise a good in-depth question! Let me explain in detail.

Suppose there are N D-dim item embeddings in the database, then for totally Q queries, Brute-force search requires QxNxD operations.

When compressing embeddings with M-subcodebook-K-codeword product quantization (PQ), we devide each D-dim embedding into M segments, each of which is a (D/M)-dim sub-embedding quantized with a K-slot sub-codebook (i.e., dictionary). The quantization process is to approximate the segment with one codeword (i.e., slot) in the sub-codebook (i.e., dictionary). After quantization, we use M Kx(D/M)-dim sub-codebooks and NxM codeword (i.e., slot) indices to store the whole database.

In inference, PQ-based search supports two distance/similarity computation strategies, namely symmetric distance computation (SDC) and asymmetric distance computation (ADC).

Empirically, ADC is more accurate than SDC for ad-hoc retrieval, so we take ADC by default to balance retrieval relevance and efficiency.

  • Regarding the storage capacity of HCQ, you mentioned The proposed HCQ represents an instance by 8 256-bit codes. Could you explain in more detail how you arrived at the 8 256-bit codes for each instance?

We chose 8x256-bit codes because there are 1 global head and 7 local heads by default. For each head, we use a 32-subcodebook-256-codeword (i.e., M=32 and K=256) PQ module by default.

It is preferable to fix K=256 to encode each codeword index with a byte (i.e., log_2(256)=8 bits). This is a common setting in vector quantization literature. That is to say, the number of bits per head is typically set to M*8 and there are 32*8=256 bits per head.

We have explored the effect of local head numbers, L, and of sub-codebook numbers, M, in Figure 6. The experimental logs have been released in this repository. You can refer to our paper and the logs for more details. ☺

RuixiangZhao commented 1 year ago

Thank you very much for your detailed reply, which answered my questions.