huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
23.97k stars 4.93k forks source link

VAE Tiling not supported with SD3 for non power of 2 images? #8788

Open Teriks opened 4 days ago

Teriks commented 4 days ago

Describe the bug

VAE tiling works for SD3 with power of 2 images, but for no other alignments.

The mentioned issues with VAE tiling are due to: vae/config.json

Having:

"use_post_quant_conv": false,
"use_quant_conv": false

Which causes the method used here:

https://github.com/huggingface/diffusers/blob/589931ca791deb8f896ee291ee481070755faa26/src/diffusers/models/autoencoders/autoencoder_kl.py#L363

And Here:

https://github.com/huggingface/diffusers/blob/589931ca791deb8f896ee291ee481070755faa26/src/diffusers/models/autoencoders/autoencoder_kl.py#L412

To be None

Perhaps at the moment, the model is simply not entirely compatible with the tiling in AutoEncoderKL, as the state dict does not possess the keys post_quant_conv.bias, quant_conv.weight, post_quant_conv.weight, quant_conv.bias

Is this intended?

Reproduction

import diffusers
import PIL.Image
import os

os.environ['HF_TOKEN'] = 'your token'

cn = diffusers.SD3ControlNetModel.from_pretrained('InstantX/SD3-Controlnet-Canny')

pipe = diffusers.StableDiffusion3ControlNetPipeline.from_pretrained(
    'stabilityai/stable-diffusion-3-medium-diffusers',
    controlnet=cn)

pipe.enable_sequential_cpu_offload()

pipe.vae.enable_tiling()

width = 1376
height = 920

# aligned by 16, but alignment by 64 also fails
output_size = (width-(width % 16), height-(height % 16))

not_pow_2 = PIL.Image.new('RGB', output_size)

args = {
    'guidance_scale': 8.0,
    'num_inference_steps': 30,
    'width': output_size[0],
    'height': output_size[1],
    'control_image': not_pow_2,
    'prompt': 'test prompt'
}

pipe(**args)

Logs

REDACT\venv\Lib\site-packages\diffusers\models\attention_processor.py:1584: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
  hidden_states = F.scaled_dot_product_attention(
Traceback (most recent call last):
  File "REDACT\test.py", line 35, in <module>
    pipe(**args)
  File "REDACT\venv\Lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\venv\Lib\site-packages\diffusers\pipelines\controlnet_sd3\pipeline_stable_diffusion_3_controlnet.py", line 912, in __call__
    control_image = self.vae.encode(control_image).latent_dist.sample()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\venv\Lib\site-packages\diffusers\utils\accelerate_utils.py", line 46, in wrapper
    return method(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\venv\Lib\site-packages\diffusers\models\autoencoders\autoencoder_kl.py", line 258, in encode
    return self.tiled_encode(x, return_dict=return_dict)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "REDACT\venv\Lib\site-packages\diffusers\models\autoencoders\autoencoder_kl.py", line 363, in tiled_encode
    tile = self.quant_conv(tile)
           ^^^^^^^^^^^^^^^^^^^^^
TypeError: 'NoneType' object is not callable

System Info

Windows

diffusers 0.29.2

Who can help?

@yiyixuxu @sayakpaul @DN6 @asomoza

DN6 commented 4 days ago

@Teriks Thanks for flagging. Opened a PR to add tiling: https://github.com/huggingface/diffusers/pull/8791