Closed MikeMACintosh closed 2 years ago
You can see the code of the web demo at https://huggingface.co/spaces/Salesforce/BLIP/blob/main/app.py
The code is basically:
import torch
from torchvision import transforms
from torchvision.transforms.functional import InterpolationMode
from models.blip import blip_decoder
image_size = 384
transform = transforms.Compose(
[
transforms.Resize(
(image_size, image_size), interpolation=InterpolationMode.BICUBIC
),
transforms.ToTensor(),
transforms.Normalize(
(0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)
),
]
)
model_url = "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_caption.pth"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = blip_decoder(pretrained=model_url, image_size=384, vit="large")
model.eval()
model = model.to(device)
def inference(raw_image):
image = transform(raw_image).unsqueeze(0).to(device)
with torch.no_grad():
caption = model.generate(
image, sample=True, top_p=0.9, max_length=20, min_length=5
)
return "caption: " + caption[0]
@woctezuma Thanks, it's pretty much the same as me, but I want to draw inferences with small 100-150 size images, but when I do image_size=384 everything works fine. I would like to finally deal with the size of the image. If I understand correctly, we choose image_size according to the images we want to detect, or is it the Vit model hyperparameters that is chosen depending on the Vit base\large? I didn't find an answer to my question in the source code.
I imagine the BLIP decoder was trained with images of size 384.
See the two configs mentioned in the README.
Image-Text Captioning
Download
COCO
andNoCaps
datasets from the original websites, and setimage_root
inconfigs/caption_coco.yaml
andconfigs/nocaps.yaml
accordingly.
Thanks @woctezuma ! Yes the decoder is trained with image_size=384.
Hi, thanks for your cool model:) I faced with problem: when i start the inference on HuggingFace i have good results, but when i test BLIP on local machine i have some troubles. For example for this image on HuggingFace(https://huggingface.co/spaces/Salesforce/BLIP) i have: 'caption: an orange and white fire hydrant on a field' and 'caption: a red and white fire hydrant sitting in the grass' for Nucleus sampling and Beam search respectively.
But local inference gave me: Beam Search-Vit base: A gray and white background with circles. Nucleus sampling-Vit base: The gray background with circles is shown in this image. Beam Search-Vit large: An image of a group of pyramids on a gray background. Nucleus sampling-Vit large: A group of white pyramids in grey and black.
Аnd so many times, this is just one example, maybe my image preprocessing is not correct and I am doing something wrong, here is a code example of how I process the image and get the caption:
model_url_base = 'path_to_base_model' model_url_large = 'path_to_large_model' model_base = blip_decoder(pretrained=model_url_base, vit='base', image_size=100) model_large = blip_decoder(pretrained=model_url_large, vit='large', image_size=100)
model_base.eval() model_base = model_base.to(device) model_large.eval() model_large = model_large.to(device)
image_size = 100 transform = transforms.Compose([ transforms.ToPILImage(), transforms.Resize((image_size, image_size), interpolation=InterpolationMode.BICUBIC), transforms.ToTensor(),
transforms.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)) ]) img = transform(sample).unsqueeze(0).to(device) # img is np.ndarray
caption_bs_base= model_base.generate(img, sample=False, num_beams=7, max_length=16, min_length=5) #beam search caption_ns_base = model_base.generate(img, sample=True, max_length=16, min_length=5) #nucleus sapling caption_bs_large = model_large.generate(img, sample=False, num_beams=7, max_length=16, min_length=5) #beam search caption_ns_large = model_large.generate(img, sample=True, max_length=16, min_length=5) #nucleus sapling
I play around InterpolationMode and tried tried not doing transforms.Normalize, but it not help. Can you help me, please? I totally confused.