Utilize the positional information from PTP

sail-sg / ptp

[CVPR2023] The code for 《Position-guided Text Prompt for Vision-Language Pre-training》

https://arxiv.org/abs/2212.09737

Apache License 2.0

148 stars 4 forks source link

Utilize the positional information from PTP #2

Closed itruonghai closed 1 year ago

itruonghai commented 1 year ago

Hi @FingerRec, @panzhous

Thank you for your great work. This work can create a significant impact in the VLP fields. I want to ask these questions regarding this work:

Given this motivation image, and a caption (e.g. There is a dog on the left). From your model, can we localize the dog position, or predict the dog mask?
Given this dog's mask (the top right image), and a caption (e.g. There is a dog on the left). From the model, can we calculate the cosine similarity between that masks and caption (the score will high if the caption referring to that masks and otherwise)?
I try to use the model to calculate the similarity between image and text, however the result are not as good as I expected. I do not know whether I did anything wrong. You can check the code here. Here is the result when I run the model with an image of elephant.

FingerRec commented 1 year ago

Hi itruonghai, Thanks for your interest in our work and sorry for late reply.

Q1: Exactly. We can calculate the similarity of the text token over all vision tokens (e.g 196) first. Then select top-k high valued matched pairs. I provide a example in Visualization.

Q2: Yes. You need to implement a average pooling layer to select masks. More details follow Q1.

Q3: Notice there will have a softmax after the logic value to normalise for multiple text captions. Like probs = logits_per_image.softmax(dim=-1).cpu().numpy(). Follow example in CLIP for more details.

itruonghai commented 1 year ago

Thank you @FingerRec for answering my question. I'm still working on Q1 and Q2. However, on Q3 question, I just want to check whether the embedding of text and image are good enough or not. Then I think I do not need to use softmax like you said aboved. Could you please check it again (Q3), or provide an example for us to use it. Thank you.

FingerRec commented 1 year ago

Hi itruonghai, to compute the similarity over image and text pair is quite simple. As below:

def process_one_sample(model, image, caption):
    with torch.no_grad():
        itm_output = model(image,caption,match_head='itm')
        itm_score = torch.nn.functional.softmax(itm_output,dim=1)[:,1]
        itc_score = model(image,caption,match_head='itc')
    return itm_score, itc_score
from models.blip_itm import blip_itm
model = blip_itm(pretrained=model_url, image_size=image_size, vit='base')
model.eval()
model = model.to(device=device)

image = load_demo_image(img_path, transform)
itm_score, itc_score = process_one_sample(model, image, caption)
# print(itm_score, itc_score)
itm_score = str(itm_score.detach().cpu().numpy()[0]) # tensor to str
itc_score = str(itc_score.detach().cpu().numpy()[0][0]) # tensor to str