salesforce / LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence
BSD 3-Clause "New" or "Revised" License
9.64k stars 940 forks source link

[itc vs itm] which is a better way to implement vector based semantic search for images? #245

Open asrlhhh opened 1 year ago

asrlhhh commented 1 year ago

In the blip2_feature_extractor notebook, I observed that the model is capable of embedding both textual and visual content into normalized vectors, enabling the comparison of cosine similarity between text and image vectors. Utilizing this functionality, I implemented a basic semantic search for text queries within a small image library using the itc method. While the search yields accurate results in certain instances, there are times when the outcomes are way off, particularly for basic queries such as "human" and "face."

I am considering the possibility of using the itm method as an alternative to itc. However, I am aware that itm may involve more computationally intensive deep learning operations at runtime for each (text, image) pair. I am curious to know if itm is a more appropriate approach for achieving the desired accuracy in semantic search results.

The search result (first two are correct, but the last two are off): image (35)

image (37)

image (33)

image (34)

dxli94 commented 1 year ago

You are correct that ITC is more efficient.

One option is to combine ITC and ITM - first select a pool of top image candidates using ITC and only do ITM within the pool.

Though I'd expect ITC works well for general purposes. Maybe you want to possibly examine the search implementation a bit before going further?

asrlhhh commented 1 year ago

Appreciate your feedback and suggestions moving forward. I actually used my sample images to run a local test using a very simple script adapted from the official notebook as below

import torch
from PIL import Image
import os
from lavis.models import load_model_and_preprocess
from lavis.processors import load_processor

test_folder = "test_folder"

# read all images
images = []
for file in os.listdir(test_folder):
    if file.endswith(".jpeg"):
        images.append(os.path.join(test_folder, file))

raw_images = [(image,Image.open(image).convert("RGB")) for image in images]
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"

caption = "face"
model, vis_processors, text_processors = load_model_and_preprocess("blip2_image_text_matching", "pretrain", device=device, is_eval=True)
txt = text_processors["eval"](caption)

for single_img in raw_images:
    print(single_img[0])
    img = vis_processors["eval"](single_img[1]).unsqueeze(0).to(device)
    itm_output = model({"image": img, "text_input": txt}, match_head="itm")
    itm_scores = torch.nn.functional.softmax(itm_output, dim=1)
    print(f'ITM: The image and text are matched with a probability of {itm_scores[:, 1].item():.3%}')
    itc_score = model({"image": img, "text_input": txt}, match_head='itc')
    print('ITC: The image feature and text feature has a cosine similarity of %.4f'%itc_score)

And the result is that, no matter using ITC or ITM, when I typed simple words like "face", the flower or building images always has the highest score. Is this result expected? The link to the test_folder is here:

https://drive.google.com/drive/folders/1PjWVpS2rKQLF9Uzp8td6b_dFUppFHkQ5?usp=sharing