Size mismatch loading Pixtral with LlavaForConditionalGeneration

RonanKMcGovern commented 2 weeks ago

System Info

transformers version: 4.45.0.dev0
Platform: Linux-5.4.0-167-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.25.0
Safetensors version: 0.4.5
Accelerate version: 0.34.2
Accelerate config: not found
PyTorch version (GPU?): 2.4.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?: yes
GPU type: NVIDIA A40

Who can help?

@amyeroberts @ArthurZucker

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I'm running the exact code shown on this page:

from transformers import LlavaForConditionalGeneration, AutoProcessor
from PIL import Image

model_id = "hf-internal-testing/pixtral-12b"
model = LlavaForConditionalGeneration.from_pretrained(model_id).to("cuda")
processor = AutoProcessor.from_pretrained(model_id)

IMG_URLS = [
    "https://picsum.photos/id/237/400/300",
    "https://picsum.photos/id/231/200/300",
    "https://picsum.photos/id/27/500/500",
    "https://picsum.photos/id/17/150/600",
]
PROMPT = "<s>[INST]Describe the images.\n[IMG][IMG][IMG][IMG][/INST]"

inputs = processor(images=IMG_URLS, text=PROMPT, return_tensors="pt").to("cuda")
generate_ids = model.generate(**inputs, max_new_tokens=500)
ouptut = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

EXPECTED_GENERATION = """
Describe the images.
Sure, let's break down each image description:

1. **Image 1:**
   - **Description:** A black dog with a glossy coat is sitting on a wooden floor. The dog has a focused expression and is looking directly at the camera.
   - **Details:** The wooden floor has a rustic appearance with visible wood grain patterns. The dog's eyes are a striking color, possibly brown or amber, which contrasts with its black fur.

2. **Image 2:**
   - **Description:** A scenic view of a mountainous landscape with a winding road cutting through it. The road is surrounded by lush green vegetation and leads to a distant valley.
   - **Details:** The mountains are rugged with steep slopes, and the sky is clear, indicating good weather. The winding road adds a sense of depth and perspective to the image.

3. **Image 3:**
   - **Description:** A beach scene with waves crashing against the shore. There are several people in the water and on the beach, enjoying the waves and the sunset.
   - **Details:** The waves are powerful, creating a dynamic and lively atmosphere. The sky is painted with hues of orange and pink from the setting sun, adding a warm glow to the scene.

4. **Image 4:**
   - **Description:** A garden path leading to a large tree with a bench underneath it. The path is bordered by well-maintained grass and flowers.
   - **Details:** The path is made of small stones or gravel, and the tree provides a shaded area with the bench invitingly placed beneath it. The surrounding area is lush and green, suggesting a well-kept garden.

Each image captures a different scene, from a close-up of a dog to expansive natural landscapes, showcasing various elements of nature and human interaction with it.
"""

Error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[4], line 5
      2 from PIL import Image
      4 model_id = "hf-internal-testing/pixtral-12b"
----> 5 model = LlavaForConditionalGeneration.from_pretrained(model_id,cache_dir='').to("cuda")
      6 processor = AutoProcessor.from_pretrained(model_id)
      8 IMG_URLS = [
      9     "https://picsum.photos/id/237/400/300",
     10     "https://picsum.photos/id/231/200/300",
     11     "https://picsum.photos/id/27/500/500",
     12     "https://picsum.photos/id/17/150/600",
     13 ]

File /usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py:3984, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
   3974     if dtype_orig is not None:
   3975         torch.set_default_dtype(dtype_orig)
   3977     (
   3978         model,
   3979         missing_keys,
   3980         unexpected_keys,
   3981         mismatched_keys,
   3982         offload_index,
   3983         error_msgs,
-> 3984     ) = cls._load_pretrained_model(
   3985         model,
   3986         state_dict,
   3987         loaded_state_dict_keys,  # XXX: rename?
   3988         resolved_archive_file,
   3989         pretrained_model_name_or_path,
   3990         ignore_mismatched_sizes=ignore_mismatched_sizes,
   3991         sharded_metadata=sharded_metadata,
   3992         _fast_init=_fast_init,
   3993         low_cpu_mem_usage=low_cpu_mem_usage,
   3994         device_map=device_map,
   3995         offload_folder=offload_folder,
   3996         offload_state_dict=offload_state_dict,
   3997         dtype=torch_dtype,
   3998         hf_quantizer=hf_quantizer,
   3999         keep_in_fp32_modules=keep_in_fp32_modules,
   4000         gguf_path=gguf_path,
   4001     )
   4003 # make sure token embedding weights are still tied if needed
   4004 model.tie_weights()

File /usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py:4529, in PreTrainedModel._load_pretrained_model(***failed resolving arguments***)
   4525     if "size mismatch" in error_msg:
   4526         error_msg += (
   4527             "\n\tYou may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method."
   4528         )
-> 4529     raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
   4531 if len(unexpected_keys) > 0:
   4532     archs = [] if model.config.architectures is None else model.config.architectures

RuntimeError: Error(s) in loading state_dict for LlavaForConditionalGeneration:
    size mismatch for language_model.model.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([4096, 5120]) from checkpoint, the shape in current model is torch.Size([5120, 5120]).

Expected behavior

I would expect the model to load normally. Something is off in the dimensions. Is there perhaps another model version on HuggingFace Hub with the correct config? Many thanks.

P.S. I had to uninstall flash attn, I assume that's just not supported, worth adding to docs.

molbap commented 2 weeks ago

Thanks for the issue @RonanKMcGovern , can confirm - the thing to change is the hub repository source, that one seems to trigger the mismatch. This one https://huggingface.co/mistral-community/pixtral-12b has a converted version of pixtral that is transformers-compatible and should load without mismatch, but you are right the code example should be updated! i.e. this works with transformers:

from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration
model_id = "mistral-community/pixtral-12b"
model = LlavaForConditionalGeneration.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

IMG_URLS = [
"https://picsum.photos/id/237/400/300", 
"https://picsum.photos/id/231/200/300", 
"https://picsum.photos/id/27/500/500",
"https://picsum.photos/id/17/150/600",
]
PROMPT = "<s>[INST]Describe the images.\n[IMG][IMG][IMG][IMG][/INST]"

inputs = processor(text=PROMPT, images=IMG_URLS, return_tensors="pt").to("cuda")
generate_ids = model.generate(**inputs, max_new_tokens=500)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(output)

RonanKMcGovern commented 2 weeks ago

That’s excellent thanks. R

On Fri 20 Sep 2024 at 12:24, Pablo Montalvo @.***> wrote:

Thanks for the issue @RonanKMcGovern https://github.com/RonanKMcGovern , can confirm - the thing to change is the hub repository source, that one seems to trigger the mismatch. This one https://huggingface.co/mistral-community/pixtral-12b has a converted version of pixtral that is transformers-compatible and should load without mismatch, but you are right the code example should be updated! i.e. this works with transformers:

from PIL import Imagefrom transformers import AutoProcessor, LlavaForConditionalGenerationmodel_id = "mistral-community/pixtral-12b"model = LlavaForConditionalGeneration.from_pretrained(model_id)processor = AutoProcessor.from_pretrained(model_id) IMG_URLS = ["https://picsum.photos/id/237/400/300", "https://picsum.photos/id/231/200/300", "https://picsum.photos/id/27/500/500","https://picsum.photos/id/17/150/600", ]PROMPT = "[INST]Describe the images.\n[IMG][IMG][IMG][IMG][/INST]" inputs = processor(text=PROMPT, images=IMG_URLS, return_tensors="pt").to("cuda")generate_ids = model.generate(**inputs, max_new_tokens=500)output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]print(output)

— Reply to this email directly, view it on GitHub https://github.com/huggingface/transformers/issues/33591#issuecomment-2363503259, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASVG6CWKC2H3BQM57DEPK73ZXQAVLAVCNFSM6AAAAABOPZTFBOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNRTGUYDGMRVHE . You are receiving this because you were mentioned.Message ID: @.***>

huggingface / transformers