Question about the Dataset construction

Hi FastComposer team, Kudos on this insightful and amazing work and thanks for sharing the code with the community!

In the paper, in the Dataset Construction part (Section 5.1), it is mentioned that

Finally, we use a greedy matching algorithm to match noun phrases with image segments. We do this by considering the product of the image-text similarity score by the OpenCLIP model (CLIP-ViT-H-14-laion2B-s32B-b79K) and the label-text similarity score by the Sentence-Transformer model (stsb-mpnet-base-v2).

Could you please clarify this further? If I understand correctly, the OpenCLIP features of image segments are matched with the sentence transformer features of the noun phrases. Is that correct? If so, how is the Image segment given as input to the OpenCLIP model - Is the part of the image outside the segment masked (with 0 pixels or black colour)? It would be great if you could share the code for this process too.

Thanks a lot!

mit-han-lab / fastcomposer

Question about the Dataset construction #34