Closed Pefect96 closed 10 months ago
All numbers in the main tables are made by the default a photo of [$] that [cond]
prompt. It is for a fair comparison with previous methods, such as Pic2Word and SEARLE.
Thank you for your reply! So why is the result in Table 4 different from Table B.5? Are the results of Table B.5 are based on ViT-L?
Thanks for pointing out! I didn't recognize it.
The results are based on ViT-L , but I presume that different models are used for Table 4 and B.5. You can check that LinCIR results are almost the same in Table 4 and B.5. It is because of the prompts and the pre-processing. Please check the comment below.
I will check the numbers for Pic2Word and SEARLE. Thanks!
@Pefect96 We just checked what makes the differences. Here are the reasons:
{
"target": "B005AD7WZI",
"candidate": "B00CZ7QJUG",
"captions": [
"is solid white",
"is a lighter color"
]
},
a photo of [$], [cond1] and [cond2]
. The embedding is used for the retrievala photo of [$] that [condX]
with cond1
and cond2
. The averaged embedding is used for the retrieval.target-pad
pre-processing, that makes a non-squared image to squared image by using padding
. On the other hand, ours uses the official CLIP pre-processing which uses center crop to make a square image.a photo of [$] that [condX]
with cond1
and cond2
. The averaged embedding is used for the retrieval. => It makes a difference for Pic2WordTherefore, Table B.5 is a little bit different from Table 4 because it uses the same prompt engineering and the same image pre-processing, where Table 4 uses different prompts and processings for each method (the models are the same). This is because the main purpose of Table B.5 is the ability to handle various text prompts, not aiming to a state-of-the-art comparison.
Thanks for your question. We will include the details in the later revision. Is there anything still unclear to you?
Thank you for your reply.
I think it's very interesting work!
The training process is clear, but there seems to be some ambiguity about the inference. For example, if the pre-trained module \phi receives images directly as input, how does it concatenate the output with conditions during inference? Table B.5 shows the results of different prompts. What kind of prompt does the author use in Table 2-5?
Looking forward to the author's reply!