jianjieluo / SCD-Net

[CVPR23] A cascaded diffusion captioning model with a novel semantic-conditional diffusion process that upgrades conventional diffusion model with additional semantic prior.
https://arxiv.org/abs/2212.03099
Other
51 stars 5 forks source link

how to use the cross-modal retrieval modal in this task? #3

Closed Sparkle-Q closed 10 months ago

Sparkle-Q commented 1 year ago

I didn't find the process that search the semantically relevant sentence from training sentence pool by using an off-the-shelf cross-modal retrieval modal, which is mentioned in the paper. Could you please show me how to do this process in the code?

jianjieluo commented 1 year ago

Hi, @Sparkle-Q ,

Sorry for the late response. We use the CLIP cosine similarity score between the input image and the training sentence pool to retrieve the semantic prior. You can further check this repo for more reference.

Best, Jianjie

1301358882 commented 2 months ago

Sorry for the late response. We use the CLIP cosine similarity score between the input image and the training sentence pool to retrieve the semantic prior. You can further check this repo for more reference.

Hello, may I ask how the training sentence pool is obtained? Where can I find the code? Thank you very much.