salesforce / BLIP

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
BSD 3-Clause "New" or "Revised" License
4.85k stars 648 forks source link

Get inconsistent result from HuggingFace BLIP and local inference #57

Closed MikeMACintosh closed 2 years ago

MikeMACintosh commented 2 years ago

Hi, thanks for your cool model:) I faced with problem: when i start the inference on HuggingFace i have good results, but when i test BLIP on local machine i have some troubles. For example for this image on HuggingFace(https://huggingface.co/spaces/Salesforce/BLIP) i have: 'caption: an orange and white fire hydrant on a field' and 'caption: a red and white fire hydrant sitting in the grass' for Nucleus sampling and Beam search respectively. 1

But local inference gave me: Beam Search-Vit base: A gray and white background with circles. Nucleus sampling-Vit base: The gray background with circles is shown in this image. Beam Search-Vit large: An image of a group of pyramids on a gray background. Nucleus sampling-Vit large: A group of white pyramids in grey and black.

Аnd so many times, this is just one example, maybe my image preprocessing is not correct and I am doing something wrong, here is a code example of how I process the image and get the caption:

model_url_base = 'path_to_base_model' model_url_large = 'path_to_large_model' model_base = blip_decoder(pretrained=model_url_base, vit='base', image_size=100) model_large = blip_decoder(pretrained=model_url_large, vit='large', image_size=100)

model_base.eval() model_base = model_base.to(device) model_large.eval() model_large = model_large.to(device)

image_size = 100 transform = transforms.Compose([ transforms.ToPILImage(), transforms.Resize((image_size, image_size), interpolation=InterpolationMode.BICUBIC), transforms.ToTensor(),
transforms.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)) ]) img = transform(sample).unsqueeze(0).to(device) # img is np.ndarray

caption_bs_base= model_base.generate(img, sample=False, num_beams=7, max_length=16, min_length=5) #beam search caption_ns_base = model_base.generate(img, sample=True, max_length=16, min_length=5) #nucleus sapling caption_bs_large = model_large.generate(img, sample=False, num_beams=7, max_length=16, min_length=5) #beam search caption_ns_large = model_large.generate(img, sample=True, max_length=16, min_length=5) #nucleus sapling

I play around InterpolationMode and tried tried not doing transforms.Normalize, but it not help. Can you help me, please? I totally confused.

woctezuma commented 2 years ago

You can see the code of the web demo at https://huggingface.co/spaces/Salesforce/BLIP/blob/main/app.py

The code is basically:

import torch
from torchvision import transforms
from torchvision.transforms.functional import InterpolationMode
from models.blip import blip_decoder

image_size = 384
transform = transforms.Compose(
    [
        transforms.Resize(
            (image_size, image_size), interpolation=InterpolationMode.BICUBIC
        ),
        transforms.ToTensor(),
        transforms.Normalize(
            (0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)
        ),
    ]
)

model_url = "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_caption.pth"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = blip_decoder(pretrained=model_url, image_size=384, vit="large")
model.eval()
model = model.to(device)

def inference(raw_image):
    image = transform(raw_image).unsqueeze(0).to(device)
    with torch.no_grad():
        caption = model.generate(
            image, sample=True, top_p=0.9, max_length=20, min_length=5
        )
        return "caption: " + caption[0]
MikeMACintosh commented 2 years ago

@woctezuma Thanks, it's pretty much the same as me, but I want to draw inferences with small 100-150 size images, but when I do image_size=384 everything works fine. I would like to finally deal with the size of the image. If I understand correctly, we choose image_size according to the images we want to detect, or is it the Vit model hyperparameters that is chosen depending on the Vit base\large? I didn't find an answer to my question in the source code.

woctezuma commented 2 years ago

I imagine the BLIP decoder was trained with images of size 384.

https://github.com/salesforce/BLIP/blob/a176f1e9cc5a232d2cc6e21b77d2c7e18ceb3c37/models/blip.py#L78-L86

See the two configs mentioned in the README.

Image-Text Captioning

Download COCO and NoCaps datasets from the original websites, and set image_root in configs/caption_coco.yaml and configs/nocaps.yaml accordingly.

https://github.com/salesforce/BLIP/blob/21aaf5d67bda30412e7a3060ca79a652491e0575/configs/caption_coco.yaml#L21

https://github.com/salesforce/BLIP/blob/21aaf5d67bda30412e7a3060ca79a652491e0575/configs/nocaps.yaml#L10

LiJunnan1992 commented 2 years ago

Thanks @woctezuma ! Yes the decoder is trained with image_size=384.