[BLIP-2] BitsAndBytes 4 and 8 bit give empty string

NielsRogge commented 7 months ago

System Info

Transformers v4.40.dev

Who can help?

@younesbelkada

Reproduction

As reported here: https://huggingface.co/Salesforce/blip2-opt-2.7b/discussions/26, the 4 and 8 bit versions of BLIP-2 return an empty string (or only special tokens) when decoding.

Here's how to reproduce:

from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", load_in_4bit=True,device_map="auto")

raw_image = Image.open("01256.png").convert('RGB')
inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=False).strip())

Expected behavior

Should return an answer similar to full/half precision

younesbelkada commented 7 months ago

hi @NielsRogge Running:

import requests
import torch

from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", load_in_4bit=True, device_map={"": 0})

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=False).strip())

Gives me correctly:

</s>a woman sitting on the beach with a dog

On transformers main + latest bitsandbytes, can you try to run that script and let me know what do you get?

chrisgao99 commented 7 months ago

Hello @younesbelkada, I met a similar problem. I tried your code, it outputs the same texts. But when I try VQA, it outputs nothing again. I've only added one question to the processor as below.

import requests
import torch

from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

cache_dir = "/p/yufeng/.cache"

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b",cache_dir=cache_dir)
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", cache_dir=cache_dir,
                                                      load_in_4bit=True, device_map={"": 0})

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "What is in the picture?"

inputs = processor(raw_image, text=question, return_tensors="pt").to("cuda", torch.float16)
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=False).strip())

Could you take a look at this problem? I truly appreciate it.

NielsRogge commented 7 months ago

Yes sorry I linked the wrong code snippet, you get an empty response when passing a text:

# pip install accelerate bitsandbytes
import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", load_in_8bit=True, device_map="auto")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

question = "how many dogs are in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True).strip())

chrisgao99 commented 7 months ago

And if I print the out,

out = model.generate(**inputs)
print(out)

I always get the same tokens

tensor([[    2, 50118]], device='cuda:0')

no matter what image or text I input. So I feel the model can't process the inputs.

younesbelkada commented 7 months ago

Hi @NielsRogge @chrisgao99 I just ran:

import requests
import torch

from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", load_in_8bit=True, device_map={"": 0})

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=False).strip())

and got:

</s>a woman sitting on the beach with a dog

With latest bnb, I am using a NVIDIA A100 GPU

NielsRogge commented 7 months ago

@younesbelkada yes that's because you're not passing a text to the processor, hence no text is being passed to the model. The bug only happens when passing an image + text

matthewdouglas commented 7 months ago

Hi all,

I can reproduce this too. But if I start making some changes to the text, or the generation strategy, I do start to get other results.

Default question = "how many dogs are in the picture?" output: [2, 50118] = ""

Prompt Change question = "how many dogs are in the picture? answer:" output: [2, 112, 50118] = " 1"

min_length=10 question = "how many dogs are in the picture?" output: [2, 111, 2335, 1058, 50118] = " - dog training"

ecekt commented 7 months ago

Hi all, I think the prompt template for VQA is as below:

question = "Question: how many dogs are in the picture? Answer:"

See the current documentation here.

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

NielsRogge commented 6 months ago

Gently pinging @younesbelkada here

tlpss commented 6 months ago

Hi all, I think the prompt template for VQA is as below:

question = "Question: how many dogs are in the picture? Answer:"

See the current documentation here.

@NielsRogge @younesbelkada @ecekt I also got an empty response using fp16 at first. Using the template, it works fine though. The model seems to be sensitive to capitals in the template as well:

import torch
import requests
from PIL import Image
from transformers import Blip2Processor, Blip2ForConditionalGeneration

processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16, device_map=0)

def ask_question(prompt):
    inputs = processor(raw_image, prompt, return_tensors="pt").to("cuda:0", torch.float16)
    out = model.generate(**inputs, max_new_tokens=100)
    return processor.decode(out[0], skip_special_tokens=True).strip()

ask_question("Is there a woman in this picture?") -> ''
ask_question("Question: Is there a woman in this picture? Answer:") -> 'Yes, there is a woman in this picture.'
ask_question("Question: Is there a woman in this picture? **a**nswer:") -> 'no, there is no woman in this picture'

younesbelkada commented 6 months ago

Hi everyone As pointed out by @tlpss & @ecekt - I don't think there is an issue here, I didn't flagged any regression between transformers versions, I was able to reproduce the empty string issue across transformers == 4.30.0 and 4.41.0 Make sure to follow the correct VQA format when prompting BLIP2 for visual question answering

younesbelkada commented 6 months ago

Closing the issue !

huggingface / transformers