navervision / lincir

Official Pytorch implementation of LinCIR: Language-only Training of Zero-shot Composed Image Retrieval (CVPR 2024)
Other
100 stars 5 forks source link

About the inference #3

Closed Pefect96 closed 10 months ago

Pefect96 commented 10 months ago

I think it's very interesting work!

The training process is clear, but there seems to be some ambiguity about the inference. For example, if the pre-trained module \phi receives images directly as input, how does it concatenate the output with conditions during inference? Table B.5 shows the results of different prompts. What kind of prompt does the author use in Table 2-5?

Looking forward to the author's reply!

SanghyukChun commented 10 months ago

All numbers in the main tables are made by the default a photo of [$] that [cond] prompt. It is for a fair comparison with previous methods, such as Pic2Word and SEARLE.

Pefect96 commented 10 months ago

Thank you for your reply! So why is the result in Table 4 different from Table B.5? Are the results of Table B.5 are based on ViT-L?

SanghyukChun commented 10 months ago

Thanks for pointing out! I didn't recognize it. The results are based on ViT-L , but I presume that different models are used for Table 4 and B.5. You can check that LinCIR results are almost the same in Table 4 and B.5. It is because of the prompts and the pre-processing. Please check the comment below. I will check the numbers for Pic2Word and SEARLE. Thanks!

SanghyukChun commented 10 months ago

@Pefect96 We just checked what makes the differences. Here are the reasons:

  1. FashionIQ query image has two text conditions. For example, here is an example query (See the official code: https://github.com/XiaoxiaoGuo/fashion-iq/tree/master/captions). Therefore, when we evaluate FIQ, we have to merge two text conditions.
    {
        "target": "B005AD7WZI",
        "candidate": "B00CZ7QJUG",
        "captions": [
            "is solid white",
            "is a lighter color"
        ]
    },
  2. We use the official Pic2Word and SEARLE inference code for Table 5. We follow SEARLE's method:
    • Pic2Word: make an embedding using a photo of [$], [cond1] and [cond2]. The embedding is used for the retrieval
    • SEARLE: make two embeddings using a photo of [$] that [condX] with cond1 and cond2. The averaged embedding is used for the retrieval.
  3. The official SEARLE code uses different image pre-processing from the official CLIP image pre-processing. SEARLE uses target-pad pre-processing, that makes a non-squared image to squared image by using padding. On the other hand, ours uses the official CLIP pre-processing which uses center crop to make a square image.
  4. In Table B.5, we use the same prompt and the same pre-processing as:
    • Prompt: make two embeddings using a photo of [$] that [condX] with cond1 and cond2. The averaged embedding is used for the retrieval. => It makes a difference for Pic2Word
    • Image pre-processing: we use center crop to make a squared image => It makes a difference for SEARLE.

Therefore, Table B.5 is a little bit different from Table 4 because it uses the same prompt engineering and the same image pre-processing, where Table 4 uses different prompts and processings for each method (the models are the same). This is because the main purpose of Table B.5 is the ability to handle various text prompts, not aiming to a state-of-the-art comparison.

Thanks for your question. We will include the details in the later revision. Is there anything still unclear to you?

Pefect96 commented 10 months ago

Thank you for your reply.