Hi FastComposer team,
Kudos on this insightful and amazing work and thanks for sharing the code with the community!
In the paper, in the Dataset Construction part (Section 5.1), it is mentioned that
Finally, we use a greedy matching algorithm to match noun
phrases with image segments. We do this by considering the product of the image-text similarity
score by the OpenCLIP model (CLIP-ViT-H-14-laion2B-s32B-b79K) and the label-text
similarity score by the Sentence-Transformer model (stsb-mpnet-base-v2).
Could you please clarify this further? If I understand correctly, the OpenCLIP features of image segments are matched with the sentence transformer features of the noun phrases. Is that correct?
If so, how is the Image segment given as input to the OpenCLIP model - Is the part of the image outside the segment masked (with 0 pixels or black colour)?
It would be great if you could share the code for this process too.
Hi FastComposer team, Kudos on this insightful and amazing work and thanks for sharing the code with the community!
In the paper, in the Dataset Construction part (Section 5.1), it is mentioned that
Could you please clarify this further? If I understand correctly, the OpenCLIP features of image segments are matched with the sentence transformer features of the noun phrases. Is that correct? If so, how is the Image segment given as input to the OpenCLIP model - Is the part of the image outside the segment masked (with 0 pixels or black colour)? It would be great if you could share the code for this process too.
Thanks a lot!