Open yuanze1024 opened 1 year ago
Can I know your transformers version?
transformers==4.30.2 I know it is not the exact version you guys suggest, but it seems incompatible to some other environments I used...
@NielsRogge Could you provide some suggestions? Thanks!
Hi,
Thanks for reporting. I've re-run the conversion script (to convert the LAVIS checkpoints to the HF format), and I've noticed setting layer_norm_eps=1e-6 (instead of 1e-5 as it is now) results in logits that match up to a tolerance of 1e-4:
Running
git clone https://github.com/huggingface/transformers.git
cd transformers
python src/transformers/models/blip_2/convert_blip_2_original_to_pytorch.py --model_name "blip2-flan-t5-xl"
results in:
First values of original logits: tensor([[-41.5262, -4.4239, -8.9971],
[-47.3791, -5.8891, -1.7391]], device='cuda:2')
First values of HF logits: tensor([[-41.5262, -4.4239, -8.9971],
[-47.3791, -5.8891, -1.7392]], device='cuda:1')
Looks ok!
Generating a caption...
Original generation: ['marina bay sands, singapore']
HF generation: ['marina bay sands, singapore']
So this only passes if I update the layer_norm_eps
of the vision encoder's config.
@yuanze1024 can you try by updating model.config.vision_config.layer_norm_eps = 1e-6
after instantiating the model, but before generating?
Update; I've also tested it using your image, and the following prompts:
Original generation: ['a 3d model of a wicker chair on a black background']
HF generation: ['a 3d model of a wicker chair on a black background']
Original generation: ['chair']
HF generation: ['chair']
Original generation: ['it is a wicker chair']
HF generation: ['it is a wicker chair']
However note that all of this is with the conversion script. I will now double check whether the same can be done with the HF models which are available on the hub.
And by the way, you can't use torch.float16
for Flan-T5 checkpoints as those were pre-trained using bfloat16
. You can cast to bfloat16
, though:
from PIL import Image
import requests
from transformers import Blip2Processor, Blip2ForConditionalGeneration
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = Blip2Processor.from_pretrained("Salesforce/blip2-flan-t5-xl")
model = Blip2ForConditionalGeneration.from_pretrained(
"Salesforce/blip2-flan-t5-xl", torch_dtype=torch.bfloat16
)
model.to(device)
url = "https://user-images.githubusercontent.com/50018861/252267123-a49ec5be-d964-4760-9ef5-3f006a353720.png"
image = Image.open(requests.get(url, stream=True).raw)
prompt = "Question: what is the structure and geometry of this chair?"
inputs = processor(images=image, text=prompt, return_tensors="pt").to(device, torch.bfloat16)
generated_ids = model.generate(**inputs,
do_sample=True,
num_beams=5,
max_length=30,
min_length=1,
top_p=0.9,
repetition_penalty=1.0,
length_penalty=1.0,
temperature=1,
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
print(generated_text)
Hey @NielsRogge, thank you for your help and forgive me for not replying in time.
To be clear in the first place, I'm using the HF ckpt here(https://huggingface.co/Salesforce/blip2-flan-t5-xxl/tree/main) and LAVIS ckpt https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_flant5xxl.pth, not convert it on my own. I will do it if necessary...
I've tried your suggestions model.config.vision_config.layer_norm_eps = 1e-6
, and find that the captions generated are the same between lavis and hf when I set do_sample=False
in some conditions. For example, I tried merlion, woman with a dog and
, captions generated by these two models are the same.
However, things are different when I tried other pictures, such as: which is obviously the proto image of the former one. The captions are different.
What's more, when I use a batch size of 2, and input the screenshot image and the proto image together, the captions come: huggingface:
# screenshot image
a chair with a backrest and a seat with a backrest and a seat with a backrest
a chair with a backrest and a seat with a backrest and a seat with a backrest and armrests
a chair with a backrest and a seat with a backrest and a seat with a backrest and a seat
a chair with a backrest and a seat with a backrest and armrests and a seat with a backrest
a chair with a backrest and a seat with a backrest and armrests
# proto image
it has a backrest and a seat with a backrest and a seat with a backrest
the sofa is a two seater with a backrest and armrests and a seat with a backrest and armrests
it has a backrest and a seat with a backrest and a seat with a backrest and a seat with
it has a backrest and a seat with a backrest and a seat with a backrest and a seat
it has a backrest and a seat with a backrest and a seat with a backrest and a footrest
lavis:
# screenshot image
a chair with a backrest and a seat with a backrest and a seat with a backrest
a chair with a backrest and a seat with a backrest and a seat with a backrest and a seat
a chair with a backrest and a seat with a backrest and a seat with a backrest and armrests
a chair with a backrest and a seat with a backrest and armrests
a chair with a backrest and a seat with a backrest and armrests and a seat with a backrest
# proto image
a wicker is a type of furniture made of woven cane or rattan
a wicker is a type of furniture that is made of woven rattan
a wicker is a type of furniture made from woven rattan
a wicker is a type of furniture made of woven rattan
a wicker is a type of furniture that is made of woven rattan or wicker
It can be found that the screenshot image' captions are violated by the proto image. Let alone the proto image.
Above all, I don't think the hf model is the same as the lavis one. This phenomenon confuses me a lot. If you have any ideas, please let me know. Thank you very very much.
BTW, I am so concerned about the issue of consistency because I want to save as much inference time as possible while maintaining consistent or similar inference results. You know, graphics cards are really expensive. So, another question is: why does it seem like the hf model is approximately three times faster than the lavis model? If the aforementioned differences are resolved, will hf still be faster?
Hi,
I've also checked sampling (providing use_nucleus_sampling=True
in LAVIS and do_sample=True
in HF Transformers), and using the same seed I'm getting the same results, at least with my conversion script. I've pushed a new checkpoint here: https://huggingface.co/nielsr/blip2-flan-t5-xl. Could you try comparing this checkpoint with the original LAVIS one?
Also note that LAVIS uses different dtypes for the various building blocks of BLIP-2 (it autocasts the vision encoder to torch.float16
, and T5 to torch.bfloat16
). So when comparing the 2 checkpoints I forked the LAVIS repo and removed all half precision casting to make sure I can compare both in float32.
@NielsRogge Can I ask you how to set the random seed?
import torch
torch.manual_seed(0)
import random
random.seed(0)
import numpy as np
np.random.seed(0)
Is that OK?
I'm using:
from transformers import set_seed
set_seed(42)
The script I'm using is here: https://github.com/NielsRogge/transformers/blob/improve_blip2/src/transformers/models/blip_2/convert_blip_2_original_to_pytorch.py.
So to reproduce you can do:
pip install -U git+https://github.com/nielsrogge/LAVIS.git@blip2_float32
git clone -b improve_blip2 https://github.com/NielsRogge/transformers.git
cd transformers
python src/transformers/models/blip_2/convert_blip_2_original_to_pytorch.py --model_name "blip2-flan-t5-xl"
Just bumping this, since it seems to affect correctness of current HF BLIP-2 behavior.
@NielsRogge - to clarify: if we're trying to use BLIP-2 from HF (with mixed precision), is the current checkpoint/HF code incorrect (e.g., model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl", torch_dtype=torch.float16)
)? This is the code snippet from the model page (https://huggingface.co/Salesforce/blip2-flan-t5-xxl#running-the-model-on-gpu).
Furthermore, if LAVIS uses bfloat16
only for T5, but loads the vision backbone in float16
precision, then isn't the call to BLIP2ForConditionalGeneration.from_pretrained("Salesforce/..., ..., torch_dtype=torch.bfloat16)
by itself problematic (if it moves the vision backbone to BF16)?
In the meantime, if I want to play it the safest for replicating BLIP-2 results -- what's the recommended workflow for using BLIP-2 checkpoints? Should I just use LAVIS directly (@LiJunnan1992)?
So to clarify:
torch.float16
for the vision encoder, torch.float32
for the Q-Former, and then torch.float16
or torch.bfloat16
depending on whether OPT or Flan-T5 is used), so by default you won't get equivalent results to HF Transformers, as from_pretrained
loads everything in torch.float32
- unless you specify a torch_dtype
. I had to explicitly fork the LAVIS repo and remove all casts to other dtypes than float32 in order to compare apples to apples. So @siddk yes you're right in that specifying torch_dtype=torch.float16
will cast all parameters to that dtype - which is not equivalent to LAVIS. However apparently both torch.float16
and torch.bfloat16
are supposed to work fine for T5 checkpoints (see https://github.com/huggingface/transformers/issues/20287 for details).Original generation: ['a yellow sofa with white legs on a black background']
HF generation: ['a yellow couch with white legs on a black background']
Also, the first message in this thread does not test equivalent results, cause the OP uses different generation settings (beam search for LAVIS and greedy decoding for HF) for getting the "object" in the image. Moreover, he uses different dtypes for both implementations, which further explain the differences.
TLDR: make sure to compare apples to apples (same generation settings + dtypes)
Thanks so much @NielsRogge -- this is incredibly helpful. Interesting that T5 supposedly works for both FP16 and BF16... do you know if it's safe to run the vision backbone in BF16 precision (assuming you load in FP32 first, then cast to BF16)?
Or is that just a model-specific unknown?
@siddk per this answer: it looks like it's always fine to go from float32/float16 to bfloat16 (but not the other way around). And bfloat16 seems to be a lot more stable for training compared to float16
Hi, thank you for your excellent works. I'm facing a problem using BLIP-2 (only inference) to generate captions and I think you may get clues about it.
Background
I'm tring Cap3D which uses BLIP-2 as a part. Anyway, In thier codes, they are using this LAVIS implement to generate captions for rendered images from a 3D model in a serialized way, which in my opinion is very slow. So I tried to generate captions for a batch of images at the same time.
Problems
For this LAVIS implement, they chose
pretrain_flant5xxl
. And for HuggingFace implement, I choseSalesforce/blip2-flan-t5-xxl
which I think should be similar to the former one. So I guess they are trained in the same way, and should share similar performance. However, I found that this LAVIS implement is about 3x slower than the HuggingFace released model, while LAVIS one can generate captions with better quality. Aren't they the same?How to reproduce it
LAVIS:
HuggingFace
https://github.com/salesforce/LAVIS/blob/f3212e7d57bf3bb6635f8ae0461f167103cee2b4/lavis/models/blip2_models/blip2_t5.py#L159-L171
Results
LAVIS
HuggingFace
We can see that the time_generate1 are quite same, but time_generate2 are different. the example image is:
Environment
Ubuntu-20.04.4 python 3.8.13 torch 1.13.0a0+d321be6 and using single A100(actually A800) in my experiment.
I also tried
with torch.cuda.amp.autocast(dtype=torch.bfloat16):
before the HF generate() to align the HF and LAVIS, and It is a bit slower(from 39s to 50s) but not slow enough to explain the difference. Please let me know if there is any lack of information for debugging. It would be appreciated if you help me find out the reason of the difference.