AttributeError: 'tuple' object has no attribute 'shape' while using IP-Adapter with StableDiffusionControlNetInpaintPipeline

satvik-pyxer commented 1 month ago

Describe the bug

    image = pipe(
  File "/home/ubuntu/env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/env/lib/python3.10/site-packages/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint.py", line 1421, in __call__
    noise_pred = self.unet(
  File "/home/ubuntu/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/env/lib/python3.10/site-packages/diffusers/models/unets/unet_2d_condition.py", line 1216, in forward
    sample, res_samples = downsample_block(
  File "/home/ubuntu/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/env/lib/python3.10/site-packages/diffusers/models/unets/unet_2d_blocks.py", line 1288, in forward
    hidden_states = attn(
  File "/home/ubuntu/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/env/lib/python3.10/site-packages/diffusers/models/transformers/transformer_2d.py", line 442, in forward
    hidden_states = block(
  File "/home/ubuntu/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/env/lib/python3.10/site-packages/diffusers/models/attention.py", line 504, in forward
    attn_output = self.attn2(
  File "/home/ubuntu/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/env/lib/python3.10/site-packages/diffusers/models/attention_processor.py", line 490, in forward
    return self.processor(
  File "/home/ubuntu/env/lib/python3.10/site-packages/diffusers/models/attention_processor.py", line 2125, in __call__
    hidden_states.shape if encoder_hidden_states is None else encoder_hidden_states.shape
AttributeError: 'tuple' object has no attribute 'shape'

The StableDiffusionControlNetInpaintPipeline pipeline is working without IP Adapter, but not when I add it.

Reproduction

from diffusers import ControlNetModel, StableDiffusionControlNetInpaintPipeline, UniPCMultistepScheduler

controlnet = ControlNetModel.from_pretrained(
        "lllyasviel/control_v11p_sd15_canny", torch_dtype=torch.float16)

pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(
        "stablediffusionapi/realistic-vision-v6.0-b1-inpaint", controlnet=controlnet, torch_dtype=torch.float16)

pipe.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")
pipe.set_ip_adapter_scale(0.6)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_xformers_memory_efficient_attention()
pipe = pipe.to("cuda")

#### inp_image, mask_image, canny_image and ip_adapter_image are all PIL images

image = pipe(
                generator=generator,
                prompt=prompt,
                negative_prompt= negative_prompt,
                controlnet_conditioning_scale=0.4,
                image=inp_image,
                guidance_scale=10,
                mask_image=mask_image,
                control_image=canny_image,
                num_inference_steps=30,
                ip_adapter_image = ip_adapter_image    
            ).images[0]

Logs

No response

System Info

diffusers: 0.30.2 (tried installing from source as well but doesn't work) torch: 2.4.1

Who can help?

@sayakpaul @yiyixuxu @asomoza @DN

sayakpaul commented 1 month ago

I think the reason why it fails is because we're using IP Adapters here and for that we have dedicated attention processor classes:

https://github.com/huggingface/diffusers/blob/8fcfb2a456e5c35d6d532faccf4859d303c22501/src/diffusers/models/attention_processor.py#L3807 (has SDPA, equivalent to using xformers for inference and is used by default in PT 2.0)
https://github.com/huggingface/diffusers/blob/8fcfb2a456e5c35d6d532faccf4859d303c22501/src/diffusers/models/attention_processor.py#L3609

Long story cut short, we should not use enable_xformers_memory_efficient_attention() here.

But @yiyixuxu @asomoza do feel free to correct me if my understanding is wrong.

yiyixuxu commented 1 month ago

yeah agree, we do not support xformer + ip-adapter yet

satvik-pyxer commented 1 month ago

Thanks @yiyixuxu @sayakpaul ! Confirming it works now.

huggingface / diffusers