Unable to use BLIP2 with caption_coco_opt6.7b at HEAD via salesforce-lavis (also HEAD)

AstraliteHeart commented 1 year ago

System Info

working:

transformers version: 4.26.1
Platform: Linux-6.0.12-x86_64-with-glibc2.10
Python version: 3.8.16
Huggingface_hub version: 0.12.0
PyTorch version (GPU?): 1.13.1+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

broken:

transformers version: 4.27.0.dev0
Platform: Linux-6.0.12-x86_64-with-glibc2.10
Python version: 3.8.16
Huggingface_hub version: 0.12.0
PyTorch version (GPU?): 1.13.1+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help?

@gante @NielsRogge

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Start with clean env setup via https://github.com/salesforce/LAVIS/blob/main/requirements.txt (transformers-4.26.1)
Run python test_simple.py, model is correctly loaded and prints a caption
pip install --upgrade git+https://github.com/huggingface/transformers (I wanted the new shiny blip2 conversion script so I can conver my finetuned model into HF format)
Resolved https://github.com/huggingface/transformers to commit 8b3db33a763ccef828fca89bac7e6cbff314f131
Run python test_simple.py
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 25 but got size 5 for tensor number 1 in the list.

import torch
from lavis.models import load_model_and_preprocess
import torch
from PIL import Image
import requests

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model, vis_processors, _ = load_model_and_preprocess(name="blip2_opt", model_type="caption_coco_opt6.7b", is_eval=True, device=device)

url = "..."
raw_image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
data = model.generate({"image": image})
print(data)

Expected behavior

Can use BLIP2 with latest HF

sgugger commented 1 year ago

cc @younesbelkada

gante commented 1 year ago

Hey @AstraliteHeart 👋 This issue seems to be a duplicate of https://github.com/huggingface/transformers/issues/21599, which is fixed.

Can I ask you to try to run your script using transformers main branch, i.e. after installing with pip install --upgrade git+https://github.com/huggingface/transformers.git?

AstraliteHeart commented 1 year ago

I don't think this is a duplicate, my env is past that fix (see p4 in the original repro steps), I've updated form main to confirm as follows:

pip install --upgrade git+https://github.com/huggingface/transformers.git
Resolved https://github.com/huggingface/transformers.git to commit bb5a2f2fc30985841289207b9f1f7765d8abc4e0
python test_simple.py
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 25 but got size 5 for tensor number 1 in the list.

gante commented 1 year ago

Thank you for confirming @AstraliteHeart 🤗 I will dig deeper and let you know what I find!

gante commented 1 year ago

After some digging, we can see that the exception is raised as follows:

│ /home/joao/hf/lib/python3.10/site-packages/lavis/models/blip2_models/modeling_opt.py:703 in      │
│ forward                                                                                          │
│                                                                                                  │
│    700 │   │   │   inputs_embeds = self.embed_tokens(input_ids)                                  │
│    701 │   │                                                                                     │
│    702 │   │   if query_embeds is not None:                                                      │
│ ❱  703 │   │   │   inputs_embeds = torch.cat([query_embeds, inputs_embeds], dim=1)               │
│    704 │   │   │   input_shape = inputs_embeds.size()[:-1]                                       │
│    705 │   │                                                                                     │
│    706 │   │   # embed positions                                                                 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 25 but got size 5 for tensor number 1 in the list.

From the full stack trace, we can conclude that the error arises from an issue in lavis, and not in transformers :) Actually, the root cause for this issue is something that we have addressed on this PR -- lavis has a different implementation, where they have a modified OPT model to handle the image embeddings, where we decided to update .generate() to handle soft-prompting.

@AstraliteHeart This means you have two options:

Update your code to rely on transformers, as opposed to lavis. See here for examples.
Open an issue in lavis, so they can help you with this issue :)

AstraliteHeart commented 1 year ago

@gante thank you for debugging!

I can confirm that syncing before https://github.com/huggingface/transformers/pull/21405 (edc1e734bfc01109b8c66881d950ebbda032a6d2) works, I'll open an issue on SF side to warn them about the breakage, unfortunately this brings me to the original issue of trying to use convert_blip_2_original_to_pytorch.py, perhaps you can help me figure out how the BLIP2 models were converted? (I understand, this is irrelevant to most users but only a few brave souls who are finetuning BLIP2 via LAVIS but want to then load it in HF.)

I've tried both pip install git+https://github.com/nielsrogge/LAVIS.git@fix_lavis (mentioned in the script) and lavis from HEAD, but I am getting this trace

$ python ./convert_blip_2_original_to_pytorch.py
Loading original model...
Position interpolate from 16x16 to 26x26
tokenizer facebook/opt-6.7b
Loading checkpoint shards: Done!
Traceback (most recent call last):
  File "./convert_blip_2_original_to_pytorch.py", line 304, in <module>
    convert_blip2_checkpoint(args.model_name, args.pytorch_dump_folder_path, args.push_to_hub)
  File "/.../envs/lavis/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "./convert_blip_2_original_to_pytorch.py", line 216, in convert_blip2_checkpoint
    original_logits = original_logits.logits
AttributeError: 'dict' object has no attribute 'logits' // indeed, this is a dictionary containing only 'loss'

what combination of versions of transformers and lavis was used during conversion?

NielsRogge commented 1 year ago

Hi,

Thanks for converting BLIP2 to HF :) I actually forked the LAVIS repo and made some tweaks to facilitate conversion (I removed a bunch of unnecessary requirements etc). See here.

AstraliteHeart commented 1 year ago

Hi Niels, thank you for checking this.

I did use your fork (or so I thought, sigh), but I redid everything from scratch while comparing traces with code and, well... turned out I moved my blip2 conversion script to LAVIS git root folder which kept including their model (as it's in the lavis folder) even with your fixed one being installed (so I do apologies).

I can now confirm that with your fork I was able to convert my model with snapshot before https://github.com/huggingface/transformers/pull/21405 and load it it in 8 bits with latest bitsandbytes keeping VRAM usage at 11.1GB (vs around 18.5GB without).

Do you have any guidance on matching outputs between lavis and hf models? I ran about 50 samples though lavis/hf16/hf8 and while hf16 and hf8 are mostly consistent (good), lavis output is better in all cases. (see anecdotal examples below)

Here is roughly how I load and run all models (https://gist.github.com/AstraliteHeart/4d7ebf834021b8e1c9bc439c1633002c) I tried to make sure all settings and rnd seeds are matching, but perhaps I am missing something?

https://derpicdn.net/img/view/2023/2/23/3051871.png

'caption_lavis': ['scootaloo, apple bloom, and applejack in a group hug scootaloo, apple bloom, and applejack are all smiling white background', 'scootaloo, applebloom, and applejack in a group hug scootaloo and applebloom are jumping applejack is smiling white background', 'scootaloo, apple bloom, and applejack in a group hug scootaloo, apple bloom, and applejack are jumping and smiling white background'],
'caption_hf_16': ['a series of images of sweetie belle, applejack, scootaloo, applebloom, rarity, pinkie pie, twilight sparkle, rarity, twilight sparkle, rarity, rarity, rarity, rarity, rarity, rarity', 'a series of images of sweetie belle, applejack, scootaloo, applebloom, rarity, pinkie pie, twilight sparkle, rarity, twilight sparkle, twilight sparkle, twilight sparkle, twilight sparkle', 'a series of images of sweetie belle, applejack, scootaloo, applebloom, rarity, pinkie pie, twilight sparkle, rarity, twilight sparkle, twilight sparkle, rarity, rarity, rarity, rarity'],
'caption_hf_8': ['a series of images of sweetie belle, applebloom, scootaloo, applejack, rarity, pinkie pie, twilight sparkle, fluttershy, rarity, pinkie pie, twilight sparkle, rarity, pink', 'a series of images of sweetie belle, applebloom, scootaloo, applejack, rarity, pinkie pie, twilight sparkle, fluttershy, rarity, pinkie pie, twilight sparkle, twilight sparkle', 'a series of images of sweetie belle, applebloom, scootaloo, applejack, rarity, pinkie pie, twilight sparkle, fluttershy, rarity, twilight sparkle, twilight sparkle, twilight sparkle']

https://derpicdn.net/img/2017/7/7/1480500/large.png

'caption_lavis': ['alicorn twilight sparkle is laying on her back with a book on her head and a book on her chest she is surrounded by books on the floor and on the walls she has a book on her head and a book on her chest she is', 'alicorn twilight sparkle is laying on her back with a book on her head and a book on her chest she is surrounded by books on the floor and on the walls she is also wearing a book on her head and a book on her chest', 'alicorn twilight sparkle is laying on her back with a book on her head and a book on her chest she is surrounded by books on the floor and on the walls she has a book on her head and a book on her chest she has'],
'caption_hf_16': ['posterior view of twilight sparkle lying on the floor surrounded by a pile of books, surrounded by a pile of books, surrounded by a pile of books, surrounded by a pile of books, surrounded by a pile of books, surrounded by', 'posterior view of twilight sparkle lying on the floor surrounded by a pile of books, surrounded by a pile of books, surrounded by a pile of books, surrounded by a pile of books, surrounded by a pile of books\n', 'posterior view of twilight sparkle lying on the floor surrounded by a pile of books, surrounded by a pile of books, surrounded by books, surrounded by books, surrounded by books, surrounded by books, surrounded by books, surrounded by books'],
'caption_hf_8': ['twilight sparkle is lying on the floor surrounded by a pile of books she is surrounded by a pile of books on the floor she is surrounded by a pile of books on the floor she is surrounded by a pile of books on the floor she','twilight sparkle is lying on the floor surrounded by a pile of books she is surrounded by a pile of books on top of her she is surrounded by a pile of books on top of her she is surrounded by a pile of books on top', 'twilight sparkle is lying on the floor surrounded by a pile of books she is surrounded by a pile of books on the floor she is surrounded by a pile of books on the floor she is surrounded by a pile of books on the floor her']

NielsRogge commented 1 year ago

Thanks for reporting, that should not be the case! I extensively tested the greedy/beam search outputs on original vs my implementation to make sure everything works as expected.

But the generate method has had some updates now so there might be a small issue. However isn't it weird that the first token is already different? cc'ing @gante here

NielsRogge commented 1 year ago

Also I'm not sure you can run both LAVIS and Transformers main branch in the same environment to compare, cause LAVIS relies on an older version of Transformers

AstraliteHeart commented 1 year ago

Results on top are from transformers https://gist.github.com/AstraliteHeart/4d7ebf834021b8e1c9bc439c1633002c + your fork of lavis.

Some more tests (tldr, latest transformers still do not produce the same output)

Official lavis repo:

['scootaloo, apple bloom, and applejack in a group hug scootaloo, apple bloom, and applejack are all smiling white background', 'scootaloo, applebloom, and applejack in a group hug scootaloo and applebloom are jumping applejack is smiling white background', 'scootaloo, apple bloom, and applejack in a group hug scootaloo, apple bloom, and applejack are jumping and smiling white background']

['alicorn twilight sparkle is laying on her back with a book on her head and a book on her chest she is surrounded by books on the floor and on the walls she has a book on her head and a book on her chest she is', 'alicorn twilight sparkle is laying on her back with a book on her head and a book on her chest she is surrounded by books on the floor and on the walls she is also wearing a book on her head and a book on her chest', 'alicorn twilight sparkle is laying on her back with a book on her head and a book on her chest she is surrounded by books on the floor and on the walls she has a book on her head and a book on her chest she has']

Latest transformers:

'caption_hf_16': [
            'a series of images of sweetie belle, applejack, scootaloo, applebloom, rarity, pinkie pie, twilight sparkle, rarity, twilight sparkle, rarity, rarity, rarity, rarity, rarity, rarity',
            'a series of images of sweetie belle, applejack, scootaloo, applebloom, rarity, pinkie pie, twilight sparkle, rarity, twilight sparkle, twilight sparkle, twilight sparkle, twilight sparkle',
            'a series of images of sweetie belle, applejack, scootaloo, applebloom, rarity, pinkie pie, twilight sparkle, rarity, twilight sparkle, twilight sparkle, rarity, rarity, rarity, rarity'
],
'caption_hf_8': [
            'a series of images of sweetie belle, applebloom, scootaloo, applejack, rarity, pinkie pie, twilight sparkle, fluttershy, rarity, pinkie pie, twilight sparkle, rarity, pink',
            'a series of images of sweetie belle, applebloom, scootaloo, applejack, rarity, pinkie pie, twilight sparkle, fluttershy, rarity, pinkie pie, twilight sparkle, twilight sparkle',
            'a series of images of sweetie belle, applebloom, scootaloo, applejack, rarity, pinkie pie, twilight sparkle, fluttershy, rarity, twilight sparkle, twilight sparkle, twilight sparkle'
]

caption_hf_16': [
            'posterior view of twilight sparkle lying on the floor surrounded by a pile of books, surrounded by a pile of books, surrounded by a pile of books, surrounded by a pile of books, surrounded by a pile of books, surrounded by',
            'posterior view of twilight sparkle lying on the floor surrounded by a pile of books, surrounded by a pile of books, surrounded by a pile of books, surrounded by a pile of books, surrounded by a pile of books\n',
            'posterior view of twilight sparkle lying on the floor surrounded by a pile of books, surrounded by a pile of books, surrounded by books, surrounded by books, surrounded by books, surrounded by books, surrounded by books, surrounded by books'
],
'caption_hf_8': [
            'twilight sparkle is lying on the floor surrounded by a pile of books she is surrounded by a pile of books on the floor she is surrounded by a pile of books on the floor she is surrounded by a pile of books on the floor she',
            'twilight sparkle is lying on the floor surrounded by a pile of books she is surrounded by a pile of books on top of her she is surrounded by a pile of books on top of her she is surrounded by a pile of books on top',
            'twilight sparkle is lying on the floor surrounded by a pile of books she is surrounded by a pile of books on the floor she is surrounded by a pile of books on the floor she is surrounded by a pile of books on the floor her'
]

gante commented 1 year ago

Hey @AstraliteHeart 👋 Differences in generation can be explained by many parts of the stack, from ninja numerical bugs to intentional implementation quirks. Debugging the exact cause takes time, so I want to ask for your help :D

Can you confirm that both lavis and transformers are recent versions? (latest release or newer)
Comparing results with sampling is impossible, as minor changes like the order of operations will produce different results. Have you confirmed that the results are different without sampling? (you can ensure that it is not sampling if you are not setting seeds and you're still getting the same outputs)
(If the answers to the questions above are positive) Can you please share a gist like the one you shared above, except without reliance on local data? It would help me get started 🤗

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

TalhaUusuf commented 1 year ago

u have any guidance on matching outputs between lav

Can you please help how you managed to convert this? I am also stuck is there any specific transformers version?

NielsRogge commented 1 year ago

I have a PR here which aims to further verify equivalence: https://github.com/huggingface/transformers/pull/24854.

The conversion script can be found here and can be run as follows:

pip install -U git+https://github.com/nielsrogge/LAVIS.git@blip2_float32
git clone -b improve_blip2 git+https://github.com/nielsrogge/transformers.git
cd transformers
python src/transformers/models/blip_2/convert_blip_2_original_to_pytorch.py --model_name "blip2-flan-t5-xl"

The reason I forked LAVIS is to make sure I can compare both implementations using float32.

huggingface / transformers