salesforce / LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence
BSD 3-Clause "New" or "Revised" License
9.99k stars 975 forks source link

The BLIP-2 implement difference between this repo and HuggingFace #418

Open yuanze1024 opened 1 year ago

yuanze1024 commented 1 year ago

Hi, thank you for your excellent works. I'm facing a problem using BLIP-2 (only inference) to generate captions and I think you may get clues about it.

Background

I'm tring Cap3D which uses BLIP-2 as a part. Anyway, In thier codes, they are using this LAVIS implement to generate captions for rendered images from a 3D model in a serialized way, which in my opinion is very slow. So I tried to generate captions for a batch of images at the same time.

Problems

For this LAVIS implement, they chose pretrain_flant5xxl. And for HuggingFace implement, I chose Salesforce/blip2-flan-t5-xxl which I think should be similar to the former one. So I guess they are trained in the same way, and should share similar performance. However, I found that this LAVIS implement is about 3x slower than the HuggingFace released model, while LAVIS one can generate captions with better quality. Aren't they the same?

How to reproduce it

LAVIS:

NUM_SENTENCE = 5
BATCH_SIZE = 5
model, vis_processors, _ = load_model_and_preprocess(name='blip2_t5', model_type='pretrain_flant5xxl', is_eval=True, device=device)

def generate_coarse_captions(model, vis_processors, images_path_list) -> torch.Tensor():
    """
    generate coarse captions
    return tensor(N, 22, NUM_SENTENCE, text_length)
    """
    global time_generate1, time_generate2, BATCH_SIZE
    prompt = "Question: what object is in this image? Answer:"
    full_prompt = "Question: what is the structure and geometry of this %s?"
    while len(images_path_list) != 0:
        if len(images_path_list) > BATCH_SIZE:
            tmp_img_path_list = images_path_list[:BATCH_SIZE]
            images_path_list = images_path_list[BATCH_SIZE:]
        else:
            tmp_img_path_list = images_path_list
            images_path_list = []
        image_list = [Image.open(image_path).convert('RGB') for image_path in tmp_img_path_list]
        image_list = [vis_processors["eval"](image).unsqueeze(0).to(device) for image in image_list]
        batch_image = torch.cat(image_list, dim=0)
        tic = time()
        # first generate the object as a part of the prompt
        object_list = model.generate({"image": batch_image, "prompt": [prompt for _ in range(batch_image.shape[0])]}, max_length=5) 
        time_generate1 += time() - tic
        print(len(object_list))
        tic = time()
        # generate the details of this ${object}
        x = model.generate({"image": batch_image, "prompt": [full_prompt % object for object in object_list]}, use_nucleus_sampling=True, num_captions=5)
        time_generate2 += time() - tic
        pprint(x)
    return x

coarse_caption_list = generate_coarse_captions(model, vis_processors, images_path_list)

HuggingFace

NUM_SENTENCE = 5
BATCH_SIZE = 5
processor = Blip2Processor.from_pretrained("Salesforce/blip2-flan-t5-xxl")

model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-flan-t5-xxl", 
    torch_dtype=torch.float16, 
    device_map="auto"
)

while len(imgs_dict) != 0:
    cnt = 0
    img_list = []
    path_list = []
    # img_prefix_list = []
    while len(imgs_dict) != 0 and cnt < BATCH_SIZE:
        cnt += 1
        path, img_name_list = imgs_dict.popitem()
        path_list.append(path)
        img_list.extend([Image.open(os.path.join(path, img_name)).convert('RGB') for img_name in img_name_list])

    tik = time()
    inputs = processor(img_list, [prompt for _ in range(len(img_list))], return_tensors="pt").to(device, torch.float16)
    time_img_process += (time() - tik)

    object_list = []
    tik = time()
    out = model.generate(**inputs)
    object_list = processor.batch_decode(out, skip_special_tokens=True)
    time_generate1 += (time() - tik)

    tik = time()
    inputs = processor(img_list, [full_prompt % object_list[_] for _ in range(len(object_list))],
                       return_tensors="pt", padding=True).to(device, torch.float16)
    time_img_process += (time() - tik)
    caption_list = []
    tik = time()
    # I just copy the default config as writtern in lavis/models/blip2_models/blip2_t5.py#L159, except for the num_return_sequences and do_sample
    out = model.generate(**inputs, 
                         do_sample=True,
                         top_p=0.9,
                         temperature=1,
                         num_beams=5,
                         max_new_tokens=30,
                         min_length=1,
                         repetition_penalty=1.0,
                         length_penalty=1.0,
                         num_return_sequences=NUM_SENTENCE, 
                         )
    caption_list = processor.batch_decode(out, skip_special_tokens=True)
    time_generate2 += (time() - tik)

https://github.com/salesforce/LAVIS/blob/f3212e7d57bf3bb6635f8ae0461f167103cee2b4/lavis/models/blip2_models/blip2_t5.py#L159-L171

Results

LAVIS

time_generate1 is 6.1642115116119385
time_generate2 is 145.6894016265869
# and the example result:
['a wicker is a type of furniture that is made of woven rattan or reeds',
 'a wicker is a type of furniture made from woven rattan',
 'a wicker is a three dimensional woven material made of rattan or reeds',
 'a wicker is a woven rattan or wicker-like material made from rattan or wick',
 'a wicker is a woven or braided material made of rattan, reeds, cane']

HuggingFace

time_load_model: 55.04085111618042
time_img_process: 4.281462907791138
time_generate1: 6.973299026489258
time_generate2: 39.95578145980835
# example :
['a slatted backrest with a seat and a backrest with a slatted seat and backrest',
 'a wicker frame with a woven rattan seat and backrest and a woven rattan armrest',
 'a wicker frame with a slatted backrest and armrests',
 'a wicker frame with a seat and a backrest',
 'a slatted backrest with a slatted seat and a slatted armrest']

We can see that the time_generate1 are quite same, but time_generate2 are different. the example image is:

image

Environment

Ubuntu-20.04.4 python 3.8.13 torch 1.13.0a0+d321be6 and using single A100(actually A800) in my experiment.


I also tried with torch.cuda.amp.autocast(dtype=torch.bfloat16): before the HF generate() to align the HF and LAVIS, and It is a bit slower(from 39s to 50s) but not slow enough to explain the difference. Please let me know if there is any lack of information for debugging. It would be appreciated if you help me find out the reason of the difference.

LiJunnan1992 commented 1 year ago

Can I know your transformers version?

yuanze1024 commented 1 year ago

transformers==4.30.2 I know it is not the exact version you guys suggest, but it seems incompatible to some other environments I used...

LiJunnan1992 commented 1 year ago

@NielsRogge Could you provide some suggestions? Thanks!

NielsRogge commented 1 year ago

Hi,

Thanks for reporting. I've re-run the conversion script (to convert the LAVIS checkpoints to the HF format), and I've noticed setting layer_norm_eps=1e-6 (instead of 1e-5 as it is now) results in logits that match up to a tolerance of 1e-4:

Running

git clone https://github.com/huggingface/transformers.git
cd transformers
python src/transformers/models/blip_2/convert_blip_2_original_to_pytorch.py --model_name "blip2-flan-t5-xl"

results in:

First values of original logits: tensor([[-41.5262,  -4.4239,  -8.9971],
        [-47.3791,  -5.8891,  -1.7391]], device='cuda:2')
First values of HF logits: tensor([[-41.5262,  -4.4239,  -8.9971],
        [-47.3791,  -5.8891,  -1.7392]], device='cuda:1')
Looks ok!
Generating a caption...
Original generation: ['marina bay sands, singapore']
HF generation: ['marina bay sands, singapore']

So this only passes if I update the layer_norm_eps of the vision encoder's config.

@yuanze1024 can you try by updating model.config.vision_config.layer_norm_eps = 1e-6 after instantiating the model, but before generating?

NielsRogge commented 1 year ago

Update; I've also tested it using your image, and the following prompts:

  1. prompt = "" (unconditional generation):
    Original generation: ['a 3d model of a wicker chair on a black background']
    HF generation: ['a 3d model of a wicker chair on a black background']
  2. prompt = "Question: what object is in this image? Answer:"
    Original generation: ['chair']
    HF generation: ['chair']
  3. prompt = "Question: what is the structure and geometry of this chair?"
    Original generation: ['it is a wicker chair']
    HF generation: ['it is a wicker chair']

    However note that all of this is with the conversion script. I will now double check whether the same can be done with the HF models which are available on the hub.

NielsRogge commented 1 year ago

And by the way, you can't use torch.float16 for Flan-T5 checkpoints as those were pre-trained using bfloat16. You can cast to bfloat16, though:

from PIL import Image
import requests
from transformers import Blip2Processor, Blip2ForConditionalGeneration
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

processor = Blip2Processor.from_pretrained("Salesforce/blip2-flan-t5-xl")
model = Blip2ForConditionalGeneration.from_pretrained(
    "Salesforce/blip2-flan-t5-xl", torch_dtype=torch.bfloat16
)
model.to(device)
url = "https://user-images.githubusercontent.com/50018861/252267123-a49ec5be-d964-4760-9ef5-3f006a353720.png"
image = Image.open(requests.get(url, stream=True).raw)

prompt = "Question: what is the structure and geometry of this chair?"
inputs = processor(images=image, text=prompt, return_tensors="pt").to(device, torch.bfloat16)

generated_ids = model.generate(**inputs, 
                        do_sample=True,
                        num_beams=5,
                        max_length=30,
                        min_length=1,
                        top_p=0.9,
                        repetition_penalty=1.0,
                        length_penalty=1.0,
                        temperature=1,
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
print(generated_text)
yuanze1024 commented 1 year ago

Hey @NielsRogge, thank you for your help and forgive me for not replying in time.

To be clear in the first place, I'm using the HF ckpt here(https://huggingface.co/Salesforce/blip2-flan-t5-xxl/tree/main) and LAVIS ckpt https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_flant5xxl.pth, not convert it on my own. I will do it if necessary...

I've tried your suggestions model.config.vision_config.layer_norm_eps = 1e-6, and find that the captions generated are the same between lavis and hf when I set do_sample=False in some conditions. For example, I tried merlion, woman with a dog and the pic of chair I uploaded , captions generated by these two models are the same.

However, things are different when I tried other pictures, such as: render_0_Z_60 which is obviously the proto image of the former one. The captions are different.

What's more, when I use a batch size of 2, and input the screenshot image and the proto image together, the captions come: huggingface:

# screenshot image
a chair with a backrest and a seat with a backrest and a seat with a backrest
a chair with a backrest and a seat with a backrest and a seat with a backrest and armrests
a chair with a backrest and a seat with a backrest and a seat with a backrest and a seat
a chair with a backrest and a seat with a backrest and armrests and a seat with a backrest
a chair with a backrest and a seat with a backrest and armrests
# proto image
it has a backrest and a seat with a backrest and a seat with a backrest
the sofa is a two seater with a backrest and armrests and a seat with a backrest and armrests
it has a backrest and a seat with a backrest and a seat with a backrest and a seat with 
it has a backrest and a seat with a backrest and a seat with a backrest and a seat
it has a backrest and a seat with a backrest and a seat with a backrest and a footrest

lavis:

# screenshot image
a chair with a backrest and a seat with a backrest and a seat with a backrest
a chair with a backrest and a seat with a backrest and a seat with a backrest and a seat
a chair with a backrest and a seat with a backrest and a seat with a backrest and armrests
a chair with a backrest and a seat with a backrest and armrests
a chair with a backrest and a seat with a backrest and armrests and a seat with a backrest
# proto image
a wicker is a type of furniture made of woven cane or rattan
a wicker is a type of furniture that is made of woven rattan
a wicker is a type of furniture made from woven rattan
a wicker is a type of furniture made of woven rattan
a wicker is a type of furniture that is made of woven rattan or wicker

It can be found that the screenshot image' captions are violated by the proto image. Let alone the proto image.

Above all, I don't think the hf model is the same as the lavis one. This phenomenon confuses me a lot. If you have any ideas, please let me know. Thank you very very much.


BTW, I am so concerned about the issue of consistency because I want to save as much inference time as possible while maintaining consistent or similar inference results. You know, graphics cards are really expensive. So, another question is: why does it seem like the hf model is approximately three times faster than the lavis model? If the aforementioned differences are resolved, will hf still be faster?

NielsRogge commented 1 year ago

Hi,

I've also checked sampling (providing use_nucleus_sampling=True in LAVIS and do_sample=True in HF Transformers), and using the same seed I'm getting the same results, at least with my conversion script. I've pushed a new checkpoint here: https://huggingface.co/nielsr/blip2-flan-t5-xl. Could you try comparing this checkpoint with the original LAVIS one?

Also note that LAVIS uses different dtypes for the various building blocks of BLIP-2 (it autocasts the vision encoder to torch.float16, and T5 to torch.bfloat16). So when comparing the 2 checkpoints I forked the LAVIS repo and removed all half precision casting to make sure I can compare both in float32.

yuanze1024 commented 1 year ago

@NielsRogge Can I ask you how to set the random seed?

import torch
torch.manual_seed(0)
import random
random.seed(0)
import numpy as np
np.random.seed(0)

Is that OK?

NielsRogge commented 1 year ago

I'm using:

from transformers import set_seed
set_seed(42)

The script I'm using is here: https://github.com/NielsRogge/transformers/blob/improve_blip2/src/transformers/models/blip_2/convert_blip_2_original_to_pytorch.py.

So to reproduce you can do:

pip install -U git+https://github.com/nielsrogge/LAVIS.git@blip2_float32
git clone -b improve_blip2 https://github.com/NielsRogge/transformers.git
cd transformers
python src/transformers/models/blip_2/convert_blip_2_original_to_pytorch.py --model_name "blip2-flan-t5-xl"
siddk commented 1 year ago

Just bumping this, since it seems to affect correctness of current HF BLIP-2 behavior.

@NielsRogge - to clarify: if we're trying to use BLIP-2 from HF (with mixed precision), is the current checkpoint/HF code incorrect (e.g., model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-flan-t5-xxl", torch_dtype=torch.float16))? This is the code snippet from the model page (https://huggingface.co/Salesforce/blip2-flan-t5-xxl#running-the-model-on-gpu).

Furthermore, if LAVIS uses bfloat16 only for T5, but loads the vision backbone in float16 precision, then isn't the call to BLIP2ForConditionalGeneration.from_pretrained("Salesforce/..., ..., torch_dtype=torch.bfloat16) by itself problematic (if it moves the vision backbone to BF16)?


In the meantime, if I want to play it the safest for replicating BLIP-2 results -- what's the recommended workflow for using BLIP-2 checkpoints? Should I just use LAVIS directly (@LiJunnan1992)?

NielsRogge commented 1 year ago

So to clarify:

Also, the first message in this thread does not test equivalent results, cause the OP uses different generation settings (beam search for LAVIS and greedy decoding for HF) for getting the "object" in the image. Moreover, he uses different dtypes for both implementations, which further explain the differences.

TLDR: make sure to compare apples to apples (same generation settings + dtypes)

siddk commented 1 year ago

Thanks so much @NielsRogge -- this is incredibly helpful. Interesting that T5 supposedly works for both FP16 and BF16... do you know if it's safe to run the vision backbone in BF16 precision (assuming you load in FP32 first, then cast to BF16)?

Or is that just a model-specific unknown?

NielsRogge commented 1 year ago

@siddk per this answer: it looks like it's always fine to go from float32/float16 to bfloat16 (but not the other way around). And bfloat16 seems to be a lot more stable for training compared to float16