Closed NielsRogge closed 5 months ago
hi @NielsRogge Running:
import requests
import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", load_in_4bit=True, device_map={"": 0})
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=False).strip())
Gives me correctly:
</s>a woman sitting on the beach with a dog
On transformers main + latest bitsandbytes, can you try to run that script and let me know what do you get?
Hello @younesbelkada, I met a similar problem. I tried your code, it outputs the same texts. But when I try VQA, it outputs nothing again. I've only added one question to the processor as below.
import requests
import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
cache_dir = "/p/yufeng/.cache"
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b",cache_dir=cache_dir)
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", cache_dir=cache_dir,
load_in_4bit=True, device_map={"": 0})
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
question = "What is in the picture?"
inputs = processor(raw_image, text=question, return_tensors="pt").to("cuda", torch.float16)
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=False).strip())
Could you take a look at this problem? I truly appreciate it.
Yes sorry I linked the wrong code snippet, you get an empty response when passing a text:
# pip install accelerate bitsandbytes
import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", load_in_8bit=True, device_map="auto")
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16)
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True).strip())
And if I print the out,
out = model.generate(**inputs)
print(out)
I always get the same tokens
tensor([[ 2, 50118]], device='cuda:0')
no matter what image or text I input. So I feel the model can't process the inputs.
Hi @NielsRogge @chrisgao99 I just ran:
import requests
import torch
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", load_in_8bit=True, device_map={"": 0})
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=False).strip())
and got:
</s>a woman sitting on the beach with a dog
With latest bnb, I am using a NVIDIA A100 GPU
@younesbelkada yes that's because you're not passing a text to the processor, hence no text is being passed to the model. The bug only happens when passing an image + text
Hi all,
I can reproduce this too. But if I start making some changes to the text, or the generation strategy, I do start to get other results.
Default question = "how many dogs are in the picture?" output: [2, 50118] = ""
Prompt Change question = "how many dogs are in the picture? answer:" output: [2, 112, 50118] = " 1"
min_length=10 question = "how many dogs are in the picture?" output: [2, 111, 2335, 1058, 50118] = " - dog training"
Hi all, I think the prompt template for VQA is as below:
question = "Question: how many dogs are in the picture? Answer:"
See the current documentation here.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Gently pinging @younesbelkada here
Hi all, I think the prompt template for VQA is as below:
question = "Question: how many dogs are in the picture? Answer:"
See the current documentation here.
@NielsRogge @younesbelkada @ecekt I also got an empty response using fp16 at first. Using the template, it works fine though. The model seems to be sensitive to capitals in the template as well:
import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration
processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16, device_map=0)
def ask_question(prompt):
inputs = processor(raw_image, prompt, return_tensors="pt").to("cuda:0", torch.float16)
out = model.generate(**inputs, max_new_tokens=100)
return processor.decode(out[0], skip_special_tokens=True).strip()
ask_question("Is there a woman in this picture?") -> ''
ask_question("Question: Is there a woman in this picture? Answer:") -> 'Yes, there is a woman in this picture.'
ask_question("Question: Is there a woman in this picture? **a**nswer:") -> 'no, there is no woman in this picture'
Hi everyone As pointed out by @tlpss & @ecekt - I don't think there is an issue here, I didn't flagged any regression between transformers versions, I was able to reproduce the empty string issue across transformers == 4.30.0 and 4.41.0 Make sure to follow the correct VQA format when prompting BLIP2 for visual question answering
Closing the issue !
System Info
Transformers v4.40.dev
Who can help?
@younesbelkada
Reproduction
As reported here: https://huggingface.co/Salesforce/blip2-opt-2.7b/discussions/26, the 4 and 8 bit versions of BLIP-2 return an empty string (or only special tokens) when decoding.
Here's how to reproduce:
Expected behavior
Should return an answer similar to full/half precision