mismatching size of latents in StableDiffusionXLInstructPix2PixPipeline with rgba images

noskill commented 3 months ago

Describe the bug

there is size mismatch between latents and latens reconstructed from the input image

Traceback (most recent call last):
  File "/home/imgen/projects/metafusion/examples/ip2p.py", line 20, in <module>
    images = pipe(prompt, image=image,
  File "/home/imgen/miniconda3/envs/py31/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/imgen/miniconda3/envs/py31/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_instruct_pix2pix.py", line 901, in __call__
    scaled_latent_model_input = torch.cat([scaled_latent_model_input, image_latents], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 68 but got size 544 for tensor number 1 in the list.

Reproduction

input image is 634 x 550

import PIL
import requests
import torch
from diffusers import StableDiffusionInstructPix2PixPipeline, EulerAncestralDiscreteScheduler
from diffusers import StableDiffusionXLInstructPix2PixPipeline

pipe = StableDiffusionXLInstructPix2PixPipeline.from_pretrained(
        "./sdxl-instructpix2pix-768", torch_dtype=torch.float16).to("cuda")
offload_device = 0
pipe.scheduler = EulerAncestralDiscreteScheduler.from_config(pipe.scheduler.config)
pipe.enable_sequential_cpu_offload(offload_device)
pipe.enable_xformers_memory_efficient_attention()
pipe.enable_vae_slicing()
pipe.enable_vae_tiling()
pipe.enable_attention_slicing()

image = PIL.Image.open('cr.png')

prompt = "make blue hair"
images = pipe(prompt, image=image,
        height=image.height, width=image.width,num_inference_steps=25, image_guidance_scale=1.5).images
images[0].save('cr_blue.png')

Logs

No response

System Info

diffusers 0.31.0.dev0

Who can help?

@yiyixuxu

noskill commented 3 months ago

The error only happen with RGBA images

yiyixuxu commented 3 months ago

ohh can you share the image input you used?

noskill commented 3 months ago

for example this image: PNG image data, 634 x 550, 8-bit/color RGBA, non-interlaced

ighoshsubho commented 3 months ago

Hi, I want to work on resolving this issue

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

asomoza commented 2 months ago

Closing this since it was resolved with the merged PR.

huggingface / diffusers