Closed itruonghai closed 1 year ago
Hi itruonghai, Thanks for your interest in our work and sorry for late reply.
Q1: Exactly. We can calculate the similarity of the text token over all vision tokens (e.g 196) first. Then select top-k high valued matched pairs. I provide a example in Visualization.
Q2: Yes. You need to implement a average pooling layer to select masks. More details follow Q1.
Q3: Notice there will have a softmax after the logic value to normalise for multiple text captions. Like
probs = logits_per_image.softmax(dim=-1).cpu().numpy()
. Follow example in CLIP for more details.
Thank you @FingerRec for answering my question. I'm still working on Q1 and Q2. However, on Q3 question, I just want to check whether the embedding of text and image are good enough or not. Then I think I do not need to use softmax like you said aboved. Could you please check it again (Q3), or provide an example for us to use it. Thank you.
Hi itruonghai, to compute the similarity over image and text pair is quite simple. As below:
def process_one_sample(model, image, caption):
with torch.no_grad():
itm_output = model(image,caption,match_head='itm')
itm_score = torch.nn.functional.softmax(itm_output,dim=1)[:,1]
itc_score = model(image,caption,match_head='itc')
return itm_score, itc_score
from models.blip_itm import blip_itm
model = blip_itm(pretrained=model_url, image_size=image_size, vit='base')
model.eval()
model = model.to(device=device)
image = load_demo_image(img_path, transform)
itm_score, itc_score = process_one_sample(model, image, caption)
# print(itm_score, itc_score)
itm_score = str(itm_score.detach().cpu().numpy()[0]) # tensor to str
itc_score = str(itc_score.detach().cpu().numpy()[0][0]) # tensor to str
Hi @FingerRec, @panzhous
Thank you for your great work. This work can create a significant impact in the VLP fields. I want to ask these questions regarding this work: