Closed hfutzzw closed 1 year ago
hi hfutzzw: Of course. We first generate a paragraph for each image in COCO dataset. Then we use zero-shot retrieval based on a conventional BERT model and frozen the parameter (without image encoder).
At last, we normalise the output of paragraph and original caption. Compute the similarity score and select the top-K samples.
Marked as resolved for no response.
Hi, thanks for your interesting work. Could you explain why better retrieval result on COCO achieved by the Image2Paragraph method more clearly?