IP_Adapters shape mismatch when generating images on v0.25.0_dev using SDXL?

salahzoubi commented 11 months ago

Describe the bug

When using ip_adapters with controlnets and sdxl (whether sdxl-turbo or sdxl1.0) you get a shape mismatch when generating images. If you remove the ip_adapter things start working again. Not sure what the problem might be here?

Reproduction

Here's what I'm doing:


from diffusers import DiffusionPipeline, StableDiffusionXLPipeline, EulerAncestralDiscreteScheduler, AutoencoderTiny, ControlNetModel
import torch
from PIL import Image

net_id = "diffusers/controlnet-canny-sdxl-1.0"
controlnet = ControlNetModel.from_pretrained(net_id, torch_dtype=torch.float16)

#stabilityai/sdxl-turbo
vae = AutoencoderTiny.from_pretrained("madebyollin/taesdxl", torch_dtype=torch.float16)
pipe = DiffusionPipeline.from_pretrained("stabilityai/sdxl-turbo", vae=vae, torch_dtype=torch.float16, controlnet=controlnet)
pipe.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter-plus_sdxl_vit-h.safetensors")
# pipe.image_encoder = CLIPVisionModelWithProjection.from_pretrained("image_encoder_xl/")

pipe = pipe.to("cuda")

control_image = load_image("1.png")
ip_image = load_image("2.png")
prompt = "person having fun"

images = pipe(
    prompt=prompt, 
    image=control_image,
    ip_adapter_image=ip_image,
    negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", 
    num_inference_steps=4,
).images[0]

Logs

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[25], line 1
----> 1 images = pipe(
      2     prompt='cute anime girl smiling, girl smile, laughing, cute', 
      3     image=control_image,
      4     ip_adapter_image=ip_image,
      5     negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", 
      6     num_inference_steps=4,
      7 ).images[0]

File ~/miniconda3/envs/sd_diff/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File ~/diffusers/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py:1208, in StableDiffusionXLPipeline.__call__(self, prompt, prompt_2, height, width, num_inference_steps, timesteps, denoising_end, guidance_scale, negative_prompt, negative_prompt_2, num_images_per_prompt, eta, generator, latents, prompt_embeds, negative_prompt_embeds, pooled_prompt_embeds, negative_pooled_prompt_embeds, ip_adapter_image, output_type, return_dict, cross_attention_kwargs, guidance_rescale, original_size, crops_coords_top_left, target_size, negative_original_size, negative_crops_coords_top_left, negative_target_size, clip_skip, callback_on_step_end, callback_on_step_end_tensor_inputs, **kwargs)
   1206 if ip_adapter_image is not None:
   1207     added_cond_kwargs["image_embeds"] = image_embeds
-> 1208 noise_pred = self.unet(
   1209     latent_model_input,
   1210     t,
   1211     encoder_hidden_states=prompt_embeds,
   1212     timestep_cond=timestep_cond,
   1213     cross_attention_kwargs=self.cross_attention_kwargs,
   1214     added_cond_kwargs=added_cond_kwargs,
   1215     return_dict=False,
   1216 )[0]
   1218 # perform guidance
   1219 if self.do_classifier_free_guidance:

File ~/miniconda3/envs/sd_diff/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File ~/miniconda3/envs/sd_diff/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File ~/diffusers/src/diffusers/models/unet_2d_condition.py:1068, in UNet2DConditionModel.forward(self, sample, timestep, encoder_hidden_states, class_labels, timestep_cond, attention_mask, cross_attention_kwargs, added_cond_kwargs, down_block_additional_residuals, mid_block_additional_residual, down_intrablock_additional_residuals, encoder_attention_mask, return_dict)
   1064         raise ValueError(
   1065             f"{self.__class__} has the config param `encoder_hid_dim_type` set to 'ip_image_proj' which requires the keyword argument `image_embeds` to be passed in  `added_conditions`"
   1066         )
   1067     image_embeds = added_cond_kwargs.get("image_embeds")
-> 1068     image_embeds = self.encoder_hid_proj(image_embeds).to(encoder_hidden_states.dtype)
   1069     encoder_hidden_states = torch.cat([encoder_hidden_states, image_embeds], dim=1)
   1071 # 2. pre-process

File ~/miniconda3/envs/sd_diff/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File ~/miniconda3/envs/sd_diff/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File ~/diffusers/src/diffusers/models/embeddings.py:881, in Resampler.forward(self, x)
    869 """Forward pass.
    870 
    871 Args:
   (...)
    877     torch.Tensor: Output Tensor.
    878 """
    879 latents = self.latents.repeat(x.size(0), 1, 1)
--> 881 x = self.proj_in(x)
    883 for ln0, ln1, attn, ff in self.layers:
    884     residual = latents

File ~/miniconda3/envs/sd_diff/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File ~/miniconda3/envs/sd_diff/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File ~/miniconda3/envs/sd_diff/lib/python3.10/site-packages/torch/nn/modules/linear.py:114, in Linear.forward(self, input)
    113 def forward(self, input: Tensor) -> Tensor:
--> 114     return F.linear(input, self.weight, self.bias)

RuntimeError: mat1 and mat2 shapes cannot be multiplied (514x1664 and 1280x1280)

System Info

diffusers version: 0.25.0.dev0
Platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.31
Python version: 3.10.13
PyTorch version (GPU?): 2.1.1 (True)
Huggingface_hub version: 0.19.4
Transformers version: 4.36.0
Accelerate version: 0.24.1
xFormers version: not installed
Using GPU in script?: True
Using distributed or parallel set-up in script?: False

Who can help?

No response

asomoza commented 11 months ago

I was scratching my head a lot with this issue a couple of days ago too, for it to work you have to use the normal image encoder not the one in the sdxl_models, I guess no one noticed it since most people just copy & paste the example.

I don't know the difference in quality with the image encoder but the shape comes directly from the model so I couldn't use the image encoder in the sdxl_models subdirectory without digging deeper in the model itself.

yiyixuxu commented 10 months ago

hi:

Thanks for the issue! Can you provide the image inputs you used so I can try to reproduce this on my end?

thanks!

YiYi

cjt222 commented 10 months ago

@yiyixuxu you can try ip-adapter-plus_sdxl_vit-h.safetensors,is can cause same problem

asomoza commented 10 months ago

I was testing this more since I was just starting to implement the IP Adapters to my code when I commented, I think this is not a problem in the code but it just need to be clarified in the documentation.

The ip-adapter_sdxl.safetensors is the only one of the SDXL IP-Adapters where they used the OpenCLIP-ViT-bigG-14 hence it need the bigger image encoder, for the rest of them they switched to the OpenCLIP-ViT-H-14 so the rest of the IP Adapters can just use the smaller image encoder.

If people are not aware of this and they try to use all of them, they will eventually encounter this error even if most won’t even use the larger one since it requires another large download and there’s no visual difference in the results.

cjt222 commented 10 months ago

I was testing this more since I was just starting to implement the IP Adapters to my code when I commented, I think this is not a problem in the code but it just need to be clarified in the documentation.

The ip-adapter_sdxl.safetensors is the only one of the SDXL IP-Adapters where they used the OpenCLIP-ViT-bigG-14 hence it need the bigger image encoder, for the rest of them they switched to the OpenCLIP-ViT-H-14 so the rest of the IP Adapters can just use the smaller image encoder.

If people are not aware of this and they try to use all of them, they will eventually encounter this error even if most won’t even use the larger one since it requires another large download and there’s no visual difference in the results.

I tested the IP adapter model of SDXL according to your method, but I found that the performance was very poor compared to SD1.5.

asomoza commented 10 months ago

Just to clarify this is not my method, is how the models were trained and the implementation made by the awesome contributors to the diffusers library, also its the same in all the UIs and APIs that use it.

I can't really compare the quality of SD 1.5 since I don't use it anymore, but I get good results with the demo images of diffusers and the official ones:

Diffuser demo with this image:	source	ip-adapter_sdxl_vit-h

Official demo image:	source	ip-adapter_sdxl_vit-h	ip-adapter-plus_sdxl_vit-h

I don't use the base model though, I always select the model that works best with the source image.

BEpresent commented 10 months ago

Just to clarify this is not my method, is how the models were trained and the implementation made by the awesome contributors to the diffusers library, also its the same in all the UIs and APIs that use it.

I can't really compare the quality of SD 1.5 since I don't use it anymore, but I get good results with the demo images of diffusers and the official ones: Official demo image:

source ip-adapter_sdxl_vit-h ip-adapter-plus_sdxl_vit-h I don't use the base model though, I always select the model that works best with the source image.

How do you load ip-adapter_sdxl_vit-h and ip-adapter-plus_sdxl_vit-h with SDXL ?

Only one that works for me is ip-adapter_sdxl (the one from the docs)

pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.safetensors")

This here gives me the shape error:

pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter-plus_sdxl_vit-h.safetensors")

asomoza commented 10 months ago

@BEpresent that's exactly the problem discussed and the misunderstanding that's not explained in the docs, to use any of the vit-h models you have to use the image encoder in the "models" subfolder and not the "sdxl_models" one, so you need to load it and pass it to the pipeline.

image_encoder = CLIPVisionModelWithProjection.from_pretrained("h94/IP-Adapter",  subfolder="models/image_encoder", torch_dtype=torch.float16,).to("cuda")
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter-plus_sdxl_vit-h.safetensors")

edit: I don't use diffusers pipeline so I forgot to add the image_encoder loading part.

BEpresent commented 10 months ago

@BEpresent that's exactly the problem discussed and the misunderstanding that's not explained in the docs, to use any of the vit-h models you have to use the image encoder in the "models" subfolder and not the "sdxl_models" one, so you need to load it and pass it to the pipeline.
image_encoder = CLIPVisionModelWithProjection.from_pretrained("h94/IP-Adapter",  subfolder="models/image_encoder", torch_dtype=torch.float16,).to("cuda")
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter-plus_sdxl_vit-h.safetensors")
edit: I don't use diffusers pipeline so I forgot to add the image_encoder loading part.

Thanks, is it recommended to use CLIPVisionModelWithProjection from transformers or diffusers (the latter does not work for me) ?

# import works
from transformers import CLIPVisionModelWithProjection
# import does not work
from diffusers import  CLIPVisionModelWithProjection

ImportError                               Traceback (most recent call last)
Cell In[10], line 1
----> 1 from diffusers import  CLIPVisionModelWithProjection

ImportError: cannot import name 'CLIPVisionModelWithProjection' from 'diffusers' (/opt/conda/lib/python3.10/site-packages/diffusers/__init__.py)

now, continuing with the transformers variant for the moment, I assume one has to pass the image encoder to the diffusers pipeline

image_encoder = CLIPVisionModelWithProjection.from_pretrained("h94/IP-Adapter",  subfolder="models/image_encoder", torch_dtype=torch.float16,).to("cuda")
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter-plus_sdxl_vit-h.safetensors", image_encoder=image_encoder)

however this also leads to a shape mismatch:

RuntimeError: mat1 and mat2 shapes cannot be multiplied (514x1664 and 1280x1280)

asomoza commented 10 months ago

diffusers doesn't have a CLIPVisionModelWithProjection, the import should be from the transformers library.

Your problem is that your passing the image encoder to the ip adapter but you have to pass it to the pipeline. Also I made a repository with just the vit-h models and the corresponding encoder to make it simplier, you can use it directly if that makes thing easier for you:

pipeline.load_ip_adapter("ozzygt/sdxl-ip-adapter", "", weight_name="ip-adapter_sdxl_vit-h.safetensors")

BEpresent commented 10 months ago

diffusers doesn't have a CLIPVisionModelWithProjection, the import should be from the transformers library.

Then the docs indeed need to be updated https://huggingface.co/docs/diffusers/using-diffusers/loading_adapters as they import it from diffusers there.

Your problem is that your passing the image encoder to the ip adapter but you have to pass it to the pipeline. Also

Thanks you're right, it works when passed to the initial pipeline, not in the pipeline.load_ip_adapter part.

This solves it for me - I guess the docs could highlight the need to switch the image encoder between OpenCLIP-ViT-bigG-14 and OpenCLIP-ViT-H-14 when using SDXL with different IP adapters. While it is being mentioned it in the model card https://huggingface.co/h94/IP-Adapter , it was not intuitively clear to me just from the diffusers docs.

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / diffusers