mertyg / vision-language-models-are-bows

Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023
MIT License
222 stars 14 forks source link

about the performance of originial CLIP #32

Closed hiker-lw closed 5 months ago

hiker-lw commented 8 months ago

hello, thanks for your great work! I would like to know if the author has tested the performance of loading the original CLIP model into the code of open_clip. I have noticed that this performance is inconsistent with using the original openai CLIP code, especially in terms of ARO Relation (ARO Relation 0.66, ARO Attribute 0.65).

vinid commented 8 months ago

Hello @hiker-lw!

Thanks for your interest! Would you be able to share your code? Might be easier for us to take a look at that!

hiker-lw commented 8 months ago

Hello @hiker-lw!

Thanks for your interest! Would you be able to share your code? Might be easier for us to take a look at that!

Thanks for your swift reply! I tested it before training NegCLIP, which means I used the open_clip's create_model_and_transforms function to load openai's pretrain model and then conducted normal testing using open_clip's code. The code was hardly modified. When training reached the fifth epoch, the performance of NegCLIP was very close to the performance in your paper. Can you please check if you are experiencing the same situation on your end? Thanks very much!

vinid commented 8 months ago

Our code loads the original CLIP model, so I'd have guessed the performance to be similar. I have no clue on why loading the model with open-clip's API might change the performance.

If you can share your code (the end-to-end pipeline from loading to results), I can take a look and run it.

hiker-lw commented 8 months ago

Our code loads the original CLIP model, so I'd have guessed the performance to be similar. I have no clue on why loading the model with open-clip's API might change the performance.

If you can share your code (the end-to-end pipeline from loading to results), I can take a look and run it.

Thanks for your reply! Here is a snippet of my testing code.

import torch
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
import os
import json
import logging
import numpy as np
from tqdm import tqdm
import pandas as pd
from PIL import Image
from easydict import EasyDict as edict

import ipdb
import open_clip    

def evaluate_aro_attribute(model_path, pretrained_path, device):
    model, _, image_preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained=pretrained_path, device=device)
    model = model.eval()
    model = CLIPWrapper(model, device=device)

    # evaluate on ARO Attribute
    dataset_dir = "../datasets/ARO_Relation_dataset"
    vga_dataset = VG_Attribution(image_preprocess=image_preprocess, root_dir=dataset_dir)
    vga_loader = DataLoader(vga_dataset, batch_size=1024, shuffle=False)
    vga_scores = model.get_retrieval_scores_batched(vga_loader)
    vga_records = vga_dataset.evaluate_scores(vga_scores)
    df = pd.DataFrame(vga_records)
    print(df)
    df = df.round({'Accuracy': 4})
    print(f"VG-Attribution Macro Accuracy: {df.Accuracy.mean():.3f}")

def evaluate_aro_relation(model_path, pretrained_path, device):
    model, _, image_preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained=pretrained_path, device=device)
    model = model.eval()
    model = CLIPWrapper(model, device=device)

    # evaluate on ARO Relation
    dataset_dir = "../datasets/ARO_Relation_dataset"
    vgr_dataset = VG_Relation(image_preprocess=image_preprocess, root_dir=dataset_dir)
    vgr_loader = DataLoader(vgr_dataset, batch_size=1024, shuffle=False)
    vgr_scores = model.get_retrieval_scores_batched(vgr_loader)
    vgr_records = vgr_dataset.evaluate_scores(vgr_scores)
    symmetric = ['leaning against','pulled by','pulling','adjusting', 'attached to', 'between', 'bigger than', 'biting', 'boarding', 'brushing', 'chewing', 'cleaning', 'climbing', 'close to', 'coming from', 'coming out of', 'contain', 'crossing', 'dragging', 'draped over', 'drinking', 'drinking from', 'driving', 'driving down', 'driving on', 'eating from', 'eating in', 'enclosing', 'exiting', 'facing', 'filled with', 'floating in', 'floating on', 'flying', 'flying above', 'flying in', 'flying over', 'flying through', 'full of', 'going down', 'going into', 'going through', 'grazing in', 'growing in', 'growing on', 'guiding', 'hanging from', 'hanging in', 'hanging off', 'hanging over', 'higher than', 'holding onto', 'hugging', 'in between', 'jumping off', 'jumping on', 'jumping over', 'kept in', 'larger than', 'leading', 'leaning over', 'leaving', 'licking', 'longer than', 'looking in', 'looking into', 'looking out', 'looking over', 'looking through', 'lying next to', 'lying on top of', 'making', 'mixed with', 'mounted on', 'moving', 'on the back of', 'on the edge of', 'on the front of', 'on the other side of', 'opening', 'painted on', 'parked at', 'parked beside', 'parked by', 'parked in', 'parked in front of', 'parked near', 'parked next to', 'perched on', 'petting', 'piled on', 'playing', 'playing in', 'playing on', 'playing with', 'pouring', 'reaching for', 'reading', 'reflected on', 'riding on', 'running in', 'running on', 'running through', 'seen through', 'sitting behind', 'sitting beside', 'sitting by', 'sitting in front of', 'sitting near', 'sitting next to', 'sitting under', 'skiing down', 'skiing on', 'sleeping in', 'sleeping on', 'smiling at', 'sniffing', 'splashing', 'sprinkled on', 'stacked on', 'standing against', 'standing around', 'standing behind', 'standing beside', 'standing in front of', 'standing near', 'standing next to', 'staring at', 'stuck in', 'surrounding', 'swimming in', 'swinging', 'talking to', 'topped with', 'touching', 'traveling down', 'traveling on', 'tying', 'typing on', 'underneath', 'wading in', 'waiting for', 'walking across', 'walking by', 'walking down', 'walking next to', 'walking through', 'working in', 'working on', 'worn on', 'wrapped around', 'wrapped in', 'by', 'of', 'near', 'next to', 'with', 'beside', 'on the side of', 'around']
    df = pd.DataFrame(vgr_records)
    df = df[~df.Relation.isin(symmetric)]
    df = df.round({'Accuracy': 4})
    print(f"VG-Relation Macro Accuracy: {df.Accuracy.mean():.3f}")

class CLIPWrapper:
    def __init__(self, model, device):
        self.model = model
        self.device = device

    @torch.no_grad()
    def get_retrieval_scores_batched(self, joint_loader):
        """Computes the scores for each image_option / caption_option pair in the joint loader.

        Args:
            joint_loader (DataLoader): batches have "image_options" and "caption_options" fields.
            "image_options" is a list of images, and "caption_options" is a list of captions.

        Returns:
            all_scores: A numpy array containing the scores of the shape NxKxL,
            where N is the number of test cases, K is the number of image options per the test case,
            and L is the number of caption options per the test case.
        """
        scores = []
        tqdm_loader = tqdm(joint_loader)
        tqdm_loader.set_description("Computing retrieval scores")
        for batch in tqdm_loader:
            image_options = []
            for i_option in batch["image_options"]:
                image_embeddings = self.model.encode_image(i_option.to(self.device)).cpu().numpy() # B x D
                image_embeddings = image_embeddings / np.linalg.norm(image_embeddings, axis=1, keepdims=True) # B x D
                image_options.append(np.expand_dims(image_embeddings, axis=1))

            caption_options = []
            for c_option in batch["caption_options"]:
                caption_tokenized = torch.cat([open_clip.tokenize(c) for c in c_option])
                caption_embeddings = self.model.encode_text(caption_tokenized.to(self.device)).cpu().numpy() # B x D
                caption_embeddings = caption_embeddings / np.linalg.norm(caption_embeddings, axis=1, keepdims=True) # B x D
                caption_options.append(np.expand_dims(caption_embeddings, axis=1))

            image_options = np.concatenate(image_options, axis=1) # B x K x D
            caption_options = np.concatenate(caption_options, axis=1) # B x L x D
            batch_scores = np.einsum("nkd,nld->nkl", image_options, caption_options) # B x K x L
            scores.append(batch_scores)

        all_scores = np.concatenate(scores, axis=0) # N x K x L
        return all_scores

class VG_Attribution(Dataset):
    def __init__(self, image_preprocess, text_perturb_fn=None, image_perturb_fn=None, root_dir="", download=False):
        '''
        image_preprocess: a function that takes in a PIL image and returns a tensor.
        text_perturb_fn: Not used for this dataset. Just for compatibility with other datasets.
        image_perturb_fn: Not used for this dataset. Just for compatibility with other datasets.
        root_dir: Directory for the VG-A dataset.
        '''
        self.root_dir = root_dir
        annotation_file = os.path.join(root_dir, "visual_genome_attribution.json")
        image_dir = os.path.join(root_dir, "images")

        with open(annotation_file, "r") as f:
            self.dataset = json.load(f)

        for item in self.dataset:
            item["image_path"] = os.path.join(image_dir, item["image_path"])

        # Set of attributes in each test case
        self.all_attributes = [f"{item['attributes'][0]}_{item['attributes'][1]}" for item in self.dataset]
        self.image_preprocess = image_preprocess

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, index):
        test_case = self.dataset[index]
        image = Image.open(test_case["image_path"]).convert('RGB')
        # Get the bounding box that contains the relation. This is to remove the irrelevant details in the scene.
        image = image.crop((test_case["bbox_x"], test_case["bbox_y"], test_case["bbox_x"] + test_case["bbox_w"], test_case["bbox_y"] + test_case["bbox_h"]))

        if self.image_preprocess is not None:
            image = self.image_preprocess(image)

        # Each test case has a correct and incorrect caption.
        true_caption = test_case["true_caption"]
        false_caption = test_case["false_caption"]
        item = edict({"image_options": [image], "caption_options": [false_caption, true_caption]})
        # item = edict({"image_options": [image], "caption_options": [f"a photo of {test_case['attributes'][1]} {test_case['obj1_name']}", f"a photo of {test_case['attributes'][0]} {test_case['obj2_name']}", false_caption, \
        #                                                                 f"a photo of {test_case['attributes'][0]} {test_case['obj1_name']}", f"a photo of {test_case['attributes'][1]} {test_case['obj2_name']}", true_caption]})
        return item

    def evaluate_scores(self, scores):
        """
        Scores: N x 1 x 2, i.e. first caption is the perturbed one, second is the positive one
        """
        if isinstance(scores, tuple):
            scores_i2t = scores[1]
            scores_t2i = scores[0] 
        else:
            scores_t2i = scores
            scores_i2t = scores

        preds = np.argmax(np.squeeze(scores_i2t, axis=1), axis=-1)
        correct_mask = (preds == 1)
        result_records = []
        all_attributes = np.array(self.all_attributes)
        for attr in np.unique(all_attributes):
            attr_mask = (all_attributes == attr)
            if attr_mask.sum() < 25:
                continue
            result_records.append({
                "Attributes": attr,
                "Accuracy": correct_mask[attr_mask].mean(),
                "Count": attr_mask.sum(),
                "Dataset": "Visual Genome Attribution"
            })
        return result_records

class VG_Relation(Dataset):
    def __init__(self, image_preprocess, root_dir):
        '''
        image_preprocess: a function that takes in a PIL image and returns a tensor.
        root_dir: Directory for the VG-R dataset.
        '''
        self.root_dir = root_dir
        annotation_file = os.path.join(root_dir, "visual_genome_relation.json")
        image_dir = os.path.join(root_dir, "images")

        with open(annotation_file, "r") as f:
            self.dataset = json.load(f)

        self.all_relations = list()
        self.all_image_path =list()
        self.all_true_caption =list()
        self.all_false_caption =list()
        for item in self.dataset:
            item["image_path"] = os.path.join(image_dir, item["image_path"])
            self.all_relations.append(item["relation_name"])
            self.all_image_path.append(item["image_path"])
            self.all_true_caption.append(item["true_caption"])
            self.all_false_caption.append(item["false_caption"])

        self.image_preprocess = image_preprocess

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, index):
        test_case = self.dataset[index]
        image = Image.open(test_case["image_path"]).convert('RGB')
        # Get the bounding box that contains the relation. This is to remove the irrelevant details in the scene.
        image = image.crop((test_case["bbox_x"], test_case["bbox_y"], test_case["bbox_x"] + test_case["bbox_w"], test_case["bbox_y"] + test_case["bbox_h"]))

        if self.image_preprocess is not None:
            image = self.image_preprocess(image)

        # Each test case has a correct and incorrect caption.
        true_caption = test_case["true_caption"]
        false_caption = test_case["false_caption"]
        item = edict({"image_options": [image], "caption_options": [false_caption, true_caption]})
        return item

    def evaluate_scores(self, scores):
        """
        Scores: N x 1 x 2, i.e. first caption is the perturbed one, second is the positive one
        """
        if isinstance(scores, tuple):
            scores_i2t = scores[1]
            scores_t2i = scores[0] 
        else:
            scores_t2i = scores
            scores_i2t = scores

        metrics = {"Accuracy": None}
        preds = np.argmax(np.squeeze(scores_i2t, axis=1), axis=-1)
        correct_mask = (preds == 1)
        metrics["Accuracy"] = np.mean(correct_mask)

        all_relations = np.array(self.all_relations)

        result_records = []
        # Log the accuracy of all relations
        for relation in np.unique(all_relations):
            relation_mask = (all_relations == relation)
            if relation_mask.sum() == 0:
                continue
            result_records.append({
                "Relation": relation,
                "Accuracy": correct_mask[relation_mask].mean(),
                "Count": relation_mask.sum(),
                "Dataset": "Visual Genome Relation"
            })
        return result_records

if __name__ == '__main__':
    model_path = "ViT-B/32"
    device = "cuda" if torch.cuda.is_available() else "cpu"
    pretrained_path = "openai"
    model, _, image_preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained=pretrained_path, device=device)
    CLIP_part_state_dict = {"state_dict": model.state_dict()}
    torch.save(CLIP_part_state_dict, "./clip_loading_using_open_clip_api.pt")
    clip_pretrained_path = "./clip_loading_using_open_clip_api.pt"
    evaluate_aro_relation(model_path, clip_pretrained_path, device)
    evaluate_aro_attribute(model_path, clip_pretrained_path, device)
vinid commented 8 months ago

How are you computing the final scores?

hiker-lw commented 8 months ago

How are you computing the final scores?

hello, the function to compute the final scores is evaluate_scores, following the code you released.

vinid commented 8 months ago

But that function is returning a list of results, right? how did you get the 0.66?

hiker-lw commented 8 months ago

But that function is returning a list of results, right? how did you get the 0.66?

Thanks for reply! that function is indeed returning a list of results, then the following code in function evaluate_aro_relation will get the final number.

vga_records = vga_dataset.evaluate_scores(vga_scores)
df = pd.DataFrame(vga_records)
print(df)
df = df.round({'Accuracy': 4})
print(f"VG-Attribution Macro Accuracy: {df.Accuracy.mean():.3f}")
vinid commented 8 months ago

Thanks! I will look and hopefully let you know if I find anything weird!

hiker-lw commented 8 months ago

Thanks! I will look and hopefully let you know if I find anything weird!

Oh, thanks for your kind help sincerely!

ytaek-oh commented 7 months ago

Hello, @hiker-lw

The precisions of each model parameter are different. The original OpenAI CLIP is fp16 by default, while it is fp32 in the case of open-clip models. When the floating precisions are matched, it gives identical results to me.

For example, in VG Relation case,

# ViT-B-32 initialized from open-clip (precision: fp_16)
macro_acc: 0.5889015395145901, micro_acc: 0.5063708902535823

# ViT-B/32 initialized from the original OpenAI CLIP (precision: fp_16) 
macro_acc: 0.5889015395145901, micro_acc: 0.5063708902535823

# ViT-B-32 initialized from open-clip (precision: **fp_32**)
macro_acc: 0.5969155559277846, micro_acc: 0.511216944479258

When initializing CLIP from open_clip library, you can pass precision="fp16" to open_clip.create_model_and_transforms. In this case, you should manually cast the input image tensors to torch.float16, as I find that open_clip does not process this.

As note, I calculated the accuracy for VG-relation as discussed in this issue #9. To summarize as below,

# Code snippet of evaluate_scores method in VG_Relation Dataset class 
for relation in np.unique(all_relations):
    relation_mask = (all_relations == relation)
    if relation_mask.sum() == 0:
        continue
    result_records.append(
        {
            "Relation": relation,
            "Accuracy": float(correct_mask[relation_mask].mean()),
            "Count": float(relation_mask.sum()),
            "Dataset": "Visual Genome Relation"
        }
    )

symmetric = ['adjusting', ...]

# macro average
df = pd.DataFrame(result_records)
df = df[df["Count"] > 9]  # why?
df = df[~df.Relation.isin(symmetric)]
accuracy = float(df.Accuracy.mean())
return {"macro_acc": accuracy, "micro_acc": np.mean(correct_mask)}, result_records
hiker-lw commented 7 months ago

oh, thank you for your clarification!

hiker-lw commented 7 months ago

cast the input image tensors to torch.float16

hello, thanks for you helpful reply, can you provide some modified code snippet of casting tensor to torch.float16? I found merely casting the image tensor to torch.float16 would release datatype error when executing self-attention computation.

hiker-lw commented 7 months ago

Hello, @hiker-lw

The precisions of each model parameter are different. The original OpenAI CLIP is fp16 by default, while it is fp32 in the case of open-clip models. When the floating precisions are matched, it gives identical results to me.

For example, in VG Relation case,

# ViT-B-32 initialized from open-clip (precision: fp_16)
macro_acc: 0.5889015395145901, micro_acc: 0.5063708902535823

# ViT-B/32 initialized from the original OpenAI CLIP (precision: fp_16) 
macro_acc: 0.5889015395145901, micro_acc: 0.5063708902535823

# ViT-B-32 initialized from open-clip (precision: **fp_32**)
macro_acc: 0.5969155559277846, micro_acc: 0.511216944479258

When initializing CLIP from open_clip library, you can pass precision="fp16" to open_clip.create_model_and_transforms. In this case, you should manually cast the input image tensors to torch.float16, as I find that open_clip does not process this.

As note, I calculated the accuracy for VG-relation as discussed in this issue #9. To summarize as below,

# Code snippet of evaluate_scores method in VG_Relation Dataset class 
for relation in np.unique(all_relations):
    relation_mask = (all_relations == relation)
    if relation_mask.sum() == 0:
        continue
    result_records.append(
        {
            "Relation": relation,
            "Accuracy": float(correct_mask[relation_mask].mean()),
            "Count": float(relation_mask.sum()),
            "Dataset": "Visual Genome Relation"
        }
    )

symmetric = ['adjusting', ...]

# macro average
df = pd.DataFrame(result_records)
df = df[df["Count"] > 9]  # why?
df = df[~df.Relation.isin(symmetric)]
accuracy = float(df.Accuracy.mean())
return {"macro_acc": accuracy, "micro_acc": np.mean(correct_mask)}, result_records

But when I load the original the OpenAI CLIP ViT-B-32 weight into open_clip code using the default precision (torch.float32), for ARO-Relation accuracy, I get an accuracy of 0.66. Below is the code snippet which can be used to reproduce this result.

import torch
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
import os
import json
import logging
import numpy as np
from tqdm import tqdm
import pandas as pd
from PIL import Image
from easydict import EasyDict as edict

import ipdb
import open_clip    

def evaluate_aro_attribute(model_path, pretrained_path, device):
    model, _, image_preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained=pretrained_path, device=device)
    model = model.eval()
    model = CLIPWrapper(model, device=device)

    # evaluate on ARO Attribute
    dataset_dir = "../datasets/ARO_Relation_dataset"
    vga_dataset = VG_Attribution(image_preprocess=image_preprocess, root_dir=dataset_dir)
    vga_loader = DataLoader(vga_dataset, batch_size=1024, shuffle=False)
    vga_scores = model.get_retrieval_scores_batched(vga_loader)
    vga_records = vga_dataset.evaluate_scores(vga_scores)
    df = pd.DataFrame(vga_records)
    print(df)
    df = df.round({'Accuracy': 4})
    print(f"VG-Attribution Macro Accuracy: {df.Accuracy.mean():.3f}")

def evaluate_aro_relation(model_path, pretrained_path, device):
    model, _, image_preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained=pretrained_path, device=device)
    model = model.eval()
    model = CLIPWrapper(model, device=device)

    # evaluate on ARO Relation
    dataset_dir = "../datasets/ARO_Relation_dataset"
    vgr_dataset = VG_Relation(image_preprocess=image_preprocess, root_dir=dataset_dir)
    vgr_loader = DataLoader(vgr_dataset, batch_size=1024, shuffle=False)
    vgr_scores = model.get_retrieval_scores_batched(vgr_loader)
    vgr_records = vgr_dataset.evaluate_scores(vgr_scores)
    symmetric = ['leaning against','pulled by','pulling','adjusting', 'attached to', 'between', 'bigger than', 'biting', 'boarding', 'brushing', 'chewing', 'cleaning', 'climbing', 'close to', 'coming from', 'coming out of', 'contain', 'crossing', 'dragging', 'draped over', 'drinking', 'drinking from', 'driving', 'driving down', 'driving on', 'eating from', 'eating in', 'enclosing', 'exiting', 'facing', 'filled with', 'floating in', 'floating on', 'flying', 'flying above', 'flying in', 'flying over', 'flying through', 'full of', 'going down', 'going into', 'going through', 'grazing in', 'growing in', 'growing on', 'guiding', 'hanging from', 'hanging in', 'hanging off', 'hanging over', 'higher than', 'holding onto', 'hugging', 'in between', 'jumping off', 'jumping on', 'jumping over', 'kept in', 'larger than', 'leading', 'leaning over', 'leaving', 'licking', 'longer than', 'looking in', 'looking into', 'looking out', 'looking over', 'looking through', 'lying next to', 'lying on top of', 'making', 'mixed with', 'mounted on', 'moving', 'on the back of', 'on the edge of', 'on the front of', 'on the other side of', 'opening', 'painted on', 'parked at', 'parked beside', 'parked by', 'parked in', 'parked in front of', 'parked near', 'parked next to', 'perched on', 'petting', 'piled on', 'playing', 'playing in', 'playing on', 'playing with', 'pouring', 'reaching for', 'reading', 'reflected on', 'riding on', 'running in', 'running on', 'running through', 'seen through', 'sitting behind', 'sitting beside', 'sitting by', 'sitting in front of', 'sitting near', 'sitting next to', 'sitting under', 'skiing down', 'skiing on', 'sleeping in', 'sleeping on', 'smiling at', 'sniffing', 'splashing', 'sprinkled on', 'stacked on', 'standing against', 'standing around', 'standing behind', 'standing beside', 'standing in front of', 'standing near', 'standing next to', 'staring at', 'stuck in', 'surrounding', 'swimming in', 'swinging', 'talking to', 'topped with', 'touching', 'traveling down', 'traveling on', 'tying', 'typing on', 'underneath', 'wading in', 'waiting for', 'walking across', 'walking by', 'walking down', 'walking next to', 'walking through', 'working in', 'working on', 'worn on', 'wrapped around', 'wrapped in', 'by', 'of', 'near', 'next to', 'with', 'beside', 'on the side of', 'around']
    df = pd.DataFrame(vgr_records)
    df = df[~df.Relation.isin(symmetric)]
    df = df.round({'Accuracy': 4})
    print(f"VG-Relation Macro Accuracy: {df.Accuracy.mean():.3f}")

class CLIPWrapper:
    def __init__(self, model, device):
        self.model = model
        self.device = device

    @torch.no_grad()
    def get_retrieval_scores_batched(self, joint_loader):
        """Computes the scores for each image_option / caption_option pair in the joint loader.

        Args:
            joint_loader (DataLoader): batches have "image_options" and "caption_options" fields.
            "image_options" is a list of images, and "caption_options" is a list of captions.

        Returns:
            all_scores: A numpy array containing the scores of the shape NxKxL,
            where N is the number of test cases, K is the number of image options per the test case,
            and L is the number of caption options per the test case.
        """
        scores = []
        tqdm_loader = tqdm(joint_loader)
        tqdm_loader.set_description("Computing retrieval scores")
        for batch in tqdm_loader:
            image_options = []
            for i_option in batch["image_options"]:
                image_embeddings = self.model.encode_image(i_option.to(self.device)).cpu().numpy() # B x D
                image_embeddings = image_embeddings / np.linalg.norm(image_embeddings, axis=1, keepdims=True) # B x D
                image_options.append(np.expand_dims(image_embeddings, axis=1))

            caption_options = []
            for c_option in batch["caption_options"]:
                caption_tokenized = torch.cat([open_clip.tokenize(c) for c in c_option])
                caption_embeddings = self.model.encode_text(caption_tokenized.to(self.device)).cpu().numpy() # B x D
                caption_embeddings = caption_embeddings / np.linalg.norm(caption_embeddings, axis=1, keepdims=True) # B x D
                caption_options.append(np.expand_dims(caption_embeddings, axis=1))

            image_options = np.concatenate(image_options, axis=1) # B x K x D
            caption_options = np.concatenate(caption_options, axis=1) # B x L x D
            batch_scores = np.einsum("nkd,nld->nkl", image_options, caption_options) # B x K x L
            scores.append(batch_scores)

        all_scores = np.concatenate(scores, axis=0) # N x K x L
        return all_scores

class VG_Attribution(Dataset):
    def __init__(self, image_preprocess, text_perturb_fn=None, image_perturb_fn=None, root_dir="", download=False):
        '''
        image_preprocess: a function that takes in a PIL image and returns a tensor.
        text_perturb_fn: Not used for this dataset. Just for compatibility with other datasets.
        image_perturb_fn: Not used for this dataset. Just for compatibility with other datasets.
        root_dir: Directory for the VG-A dataset.
        '''
        self.root_dir = root_dir
        annotation_file = os.path.join(root_dir, "visual_genome_attribution.json")
        image_dir = os.path.join(root_dir, "images")

        with open(annotation_file, "r") as f:
            self.dataset = json.load(f)

        for item in self.dataset:
            item["image_path"] = os.path.join(image_dir, item["image_path"])

        # Set of attributes in each test case
        self.all_attributes = [f"{item['attributes'][0]}_{item['attributes'][1]}" for item in self.dataset]
        self.image_preprocess = image_preprocess

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, index):
        test_case = self.dataset[index]
        image = Image.open(test_case["image_path"]).convert('RGB')
        # Get the bounding box that contains the relation. This is to remove the irrelevant details in the scene.
        image = image.crop((test_case["bbox_x"], test_case["bbox_y"], test_case["bbox_x"] + test_case["bbox_w"], test_case["bbox_y"] + test_case["bbox_h"]))

        if self.image_preprocess is not None:
            image = self.image_preprocess(image)

        # Each test case has a correct and incorrect caption.
        true_caption = test_case["true_caption"]
        false_caption = test_case["false_caption"]
        item = edict({"image_options": [image], "caption_options": [false_caption, true_caption]})
        # item = edict({"image_options": [image], "caption_options": [f"a photo of {test_case['attributes'][1]} {test_case['obj1_name']}", f"a photo of {test_case['attributes'][0]} {test_case['obj2_name']}", false_caption, \
        #                                                                 f"a photo of {test_case['attributes'][0]} {test_case['obj1_name']}", f"a photo of {test_case['attributes'][1]} {test_case['obj2_name']}", true_caption]})
        return item

    def evaluate_scores(self, scores):
        """
        Scores: N x 1 x 2, i.e. first caption is the perturbed one, second is the positive one
        """
        if isinstance(scores, tuple):
            scores_i2t = scores[1]
            scores_t2i = scores[0] 
        else:
            scores_t2i = scores
            scores_i2t = scores

        preds = np.argmax(np.squeeze(scores_i2t, axis=1), axis=-1)
        correct_mask = (preds == 1)
        result_records = []
        all_attributes = np.array(self.all_attributes)
        for attr in np.unique(all_attributes):
            attr_mask = (all_attributes == attr)
            if attr_mask.sum() < 25:
                continue
            result_records.append({
                "Attributes": attr,
                "Accuracy": correct_mask[attr_mask].mean(),
                "Count": attr_mask.sum(),
                "Dataset": "Visual Genome Attribution"
            })
        return result_records

class VG_Relation(Dataset):
    def __init__(self, image_preprocess, root_dir):
        '''
        image_preprocess: a function that takes in a PIL image and returns a tensor.
        root_dir: Directory for the VG-R dataset.
        '''
        self.root_dir = root_dir
        annotation_file = os.path.join(root_dir, "visual_genome_relation.json")
        image_dir = os.path.join(root_dir, "images")

        with open(annotation_file, "r") as f:
            self.dataset = json.load(f)

        self.all_relations = list()
        self.all_image_path =list()
        self.all_true_caption =list()
        self.all_false_caption =list()
        for item in self.dataset:
            item["image_path"] = os.path.join(image_dir, item["image_path"])
            self.all_relations.append(item["relation_name"])
            self.all_image_path.append(item["image_path"])
            self.all_true_caption.append(item["true_caption"])
            self.all_false_caption.append(item["false_caption"])

        self.image_preprocess = image_preprocess

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, index):
        test_case = self.dataset[index]
        image = Image.open(test_case["image_path"]).convert('RGB')
        # Get the bounding box that contains the relation. This is to remove the irrelevant details in the scene.
        image = image.crop((test_case["bbox_x"], test_case["bbox_y"], test_case["bbox_x"] + test_case["bbox_w"], test_case["bbox_y"] + test_case["bbox_h"]))

        if self.image_preprocess is not None:
            image = self.image_preprocess(image)

        # Each test case has a correct and incorrect caption.
        true_caption = test_case["true_caption"]
        false_caption = test_case["false_caption"]
        item = edict({"image_options": [image], "caption_options": [false_caption, true_caption]})
        return item

    def evaluate_scores(self, scores):
        """
        Scores: N x 1 x 2, i.e. first caption is the perturbed one, second is the positive one
        """
        if isinstance(scores, tuple):
            scores_i2t = scores[1]
            scores_t2i = scores[0] 
        else:
            scores_t2i = scores
            scores_i2t = scores

        metrics = {"Accuracy": None}
        preds = np.argmax(np.squeeze(scores_i2t, axis=1), axis=-1)
        correct_mask = (preds == 1)
        metrics["Accuracy"] = np.mean(correct_mask)

        all_relations = np.array(self.all_relations)

        result_records = []
        # Log the accuracy of all relations
        for relation in np.unique(all_relations):
            relation_mask = (all_relations == relation)
            if relation_mask.sum() == 0:
                continue
            result_records.append({
                "Relation": relation,
                "Accuracy": correct_mask[relation_mask].mean(),
                "Count": relation_mask.sum(),
                "Dataset": "Visual Genome Relation"
            })
        return result_records

if __name__ == '__main__':
    model_path = "ViT-B/32"
    device = "cuda" if torch.cuda.is_available() else "cpu"
    pretrained_path = "openai"
    model, _, image_preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained=pretrained_path, device=device)
    CLIP_part_state_dict = {"state_dict": model.state_dict()}
    torch.save(CLIP_part_state_dict, "./clip_loading_using_open_clip_api.pt")
    clip_pretrained_path = "./clip_loading_using_open_clip_api.pt"
    evaluate_aro_relation(model_path, clip_pretrained_path, device)
    evaluate_aro_attribute(model_path, clip_pretrained_path, device)
ytaek-oh commented 7 months ago

When open_clip model is initialized with fp_16, for example,

# ./model_zoo/init.py
MODELS = [
    # openai
    'openai:RN50',
    'openai:ViT-B-32',
   ...
]  # open_clip CLIP model names; {pretrained}:{arch_name}

def get_model(model_name, device, root_dir=CACHE_DIR, apply_fp16=False):
    if "openai-clip" in model_name:  # original open-ai CLIP
        root_dir = join_path(root_dir, "openai-clip")
        variant = model_name.split(":")[1]
        model, image_preprocess = clip.load(variant, device=device, download_root=root_dir)
        model = model.eval()
        clip_model = CLIPWrapper(model, device)
        return clip_model, image_preprocess

    elif model_name in MODELS:  # open_clip models
        tag, model_name = model_name.split(":")
        cache_dir = get_cache_dir(model_name, tag)
        logger.info(f"Loading {tag}:{model_name} model from {cache_dir}..")
        precision = "fp16" if apply_fp16 else "fp32"
        model, _, image_preprocess = open_clip.create_model_and_transforms(
            model_name=model_name,
            pretrained=tag,
            cache_dir=cache_dir,
            device=device,
            **precision=precision**
        )
        # tokenizer = open_clip.get_tokenizer(model_name)

        model = model.eval()
        return CLIPWrapper(model.eval(), device, precision=precision), image_preprocess

    elif "blip" in model_name:
        ...

I also modified the code in get_retrieval_scores_batched method in CLIPWrapper class.

class CLIPWrapper:

    def __init__(self, model, device, precision=None):
        self.model = model
        self.device = device
        self.precision = precision  # <-- here

    @torch.no_grad()
    def get_retrieval_scores_batched(self, joint_loader):
        scores = []
        tqdm_loader = tqdm(joint_loader)
        tqdm_loader.set_description("Computing retrieval scores")
        for batch in tqdm_loader:
            image_options = []
            for i_option in batch["image_options"]:
                input_image = i_option.to(self.device)
                if self.precision == "fp16":
                    **input_image = input_image.to(torch.float16)**
                image_embeddings = self.model.encode_image(input_image).cpu().numpy()  # B x D
                image_embeddings = image_embeddings / np.linalg.norm(
                    image_embeddings, axis=1, keepdims=True
                )  # B x D
                image_options.append(np.expand_dims(image_embeddings, axis=1))
        ...

When you check the open_clip code here, the variable cast_dtype matches the dtype of input tensors to the dtype of model parameters, but it only does this for the text inputs. So I manually convert the image tensor dtype to torch.float16. This was all for me.

As an additional info, my local environment is run on a docker container cloned from nvidia/cuda:11.7.1-cudnn8-devel-ubuntu20.04. Python version is 3.8.10 with pytorch==2.0.1 Also, I think the codes related to the evaluation pipeline of ARO benchmarks (i.e., dataset class and eval code) remain in general compared to the official implementation.

hiker-lw commented 7 months ago

When open_clip model is initialized with fp_16, for example,

# ./model_zoo/init.py
MODELS = [
    # openai
    'openai:RN50',
    'openai:ViT-B-32',
   ...
]  # open_clip CLIP model names; {pretrained}:{arch_name}

def get_model(model_name, device, root_dir=CACHE_DIR, apply_fp16=False):
    if "openai-clip" in model_name:  # original open-ai CLIP
        root_dir = join_path(root_dir, "openai-clip")
        variant = model_name.split(":")[1]
        model, image_preprocess = clip.load(variant, device=device, download_root=root_dir)
        model = model.eval()
        clip_model = CLIPWrapper(model, device)
        return clip_model, image_preprocess

    elif model_name in MODELS:  # open_clip models
        tag, model_name = model_name.split(":")
        cache_dir = get_cache_dir(model_name, tag)
        logger.info(f"Loading {tag}:{model_name} model from {cache_dir}..")
        precision = "fp16" if apply_fp16 else "fp32"
        model, _, image_preprocess = open_clip.create_model_and_transforms(
            model_name=model_name,
            pretrained=tag,
            cache_dir=cache_dir,
            device=device,
            **precision=precision**
        )
        # tokenizer = open_clip.get_tokenizer(model_name)

        model = model.eval()
        return CLIPWrapper(model.eval(), device, precision=precision), image_preprocess

    elif "blip" in model_name:
        ...

I also modified the code in get_retrieval_scores_batched method in CLIPWrapper class.

class CLIPWrapper:

    def __init__(self, model, device, precision=None):
        self.model = model
        self.device = device
        self.precision = precision  # <-- here

    @torch.no_grad()
    def get_retrieval_scores_batched(self, joint_loader):
        scores = []
        tqdm_loader = tqdm(joint_loader)
        tqdm_loader.set_description("Computing retrieval scores")
        for batch in tqdm_loader:
            image_options = []
            for i_option in batch["image_options"]:
                input_image = i_option.to(self.device)
                if self.precision == "fp16":
                    **input_image = input_image.to(torch.float16)**
                image_embeddings = self.model.encode_image(input_image).cpu().numpy()  # B x D
                image_embeddings = image_embeddings / np.linalg.norm(
                    image_embeddings, axis=1, keepdims=True
                )  # B x D
                image_options.append(np.expand_dims(image_embeddings, axis=1))
        ...

When you check the open_clip code here, the variable cast_dtype matches the dtype of input tensors to the dtype of model parameters, but it only does this for the text inputs. So I manually convert the image tensor dtype to torch.float16. This was all for me.

As an additional info, my local environment is run on a docker container cloned from nvidia/cuda:11.7.1-cudnn8-devel-ubuntu20.04. Python version is 3.8.10 with pytorch==2.0.1 Also, I think the codes related to the evaluation pipeline of ARO benchmarks (i.e., dataset class and eval code) remain in general compared to the official implementation.

Thanks for swift reply! If directly using open_clip's api to load ViT-B-32, the accuracy of ARO-Relation is indeed around 0.59 whether using fp16 or fp32 (so maybe it's not the precision problem?), but if you firstly save the model (using torch.save()) after loading openai model using open_clip's api, then you load the saved model using open_clip's api, the accuracy is around 0.65. It's somehow pretty strange, therefore we are not sure which one is the actual performance of ViT-B-32 . You can reproduce this result using the code I posted in the previous reply. Thanks again very much!