huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
25.62k stars 5.29k forks source link

encoder_hidden_states=None not working on unet_2d_condition #1802

Closed ghost closed 1 year ago

ghost commented 1 year ago

Describe the bug

Not sure if this is a bug or not.

In stable diffusion pipeline, I want to put None in encoder_hidden_states to get the true unconditional noise_pred, not with a text embedding of blank strings. However, this is giving me the error: RuntimeError: mat1 and mat2 shapes cannot be multiplied (4096x320 and 768x320)

How should I fix this? How do I make the unet skip cross attention?

Reproduction

_, text_embeddings = text_embeddings.chunk(2)
# expand the latents if we are doing classifier free guidance
latent_model_input = latents
latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)

# predict the noise residual
noise_pred_text = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample

# perform guidance
if do_classifier_free_guidance:
    for downsample_block in self.unet.down_blocks:
        downsample_block.has_cross_attention = False
    for upsample_block in self.unet.up_blocks:
        upsample_block.has_cross_attention = False
    noise_pred_uncond = self.unet(latent_model_input, t, encoder_hidden_states=None).sample

    noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

    # compute the previous noisy sample x_t -> x_t-1
    latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs).prev_sample

Logs

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[3], line 9
      7 result = []
      8 pipe.text_encoder.config.use_attention_mask = False
----> 9 for image in pipe.txt2img(prompt=prompt, negative_prompt=negative_prompt, seed=seed, num_inference_steps=num_inference_steps, guidance_scale=guidance_scale):
     10     result += image
     11 pipe.text_encoder.config.use_attention_mask = True

File c:\CODE\svelte-diffusion\venv\lib\site-packages\torch\autograd\grad_mode.py:43, in _DecoratorContextManager._wrap_generator.<locals>.generator_context(*args, **kwargs)
     40 try:
     41     # Issuing `None` to a generator fires it up
     42     with self.clone():
---> 43         response = gen.send(None)
     45     while True:
     46         try:
     47             # Forward the response to our caller and get its next request

File c:\CODE\svelte-diffusion\custom_pipe.py:192, in StableDiffusionGigaPipeline.txt2img(self, prompt, height, width, num_inference_steps, guidance_scale, negative_prompt, num_images_per_prompt, eta, seed, latents, output_type, return_dict, callback, callback_steps, batch_size, unet_cross_attention)
    190     if batch != 0:
    191         generator = None
--> 192     yield txt2img(self,
    193         prompt=prompt,
    194         height=height,
    195         width=width,
...
File c:\CODE\svelte-diffusion\venv\lib\site-packages\torch\nn\modules\linear.py:114, in Linear.forward(self, input)
    113 def forward(self, input: Tensor) -> Tensor:
--> 114     return F.linear(input, self.weight, self.bias)

RuntimeError: mat1 and mat2 shapes cannot be multiplied (4096x320 and 768x320)


### System Info

- `diffusers` version: 0.11.0
- Platform: Windows-10-10.0.19045-SP0
- Python version: 3.10.8
- PyTorch version (GPU?): 1.13.0+cu117 (True)
- Huggingface_hub version: 0.11.1
- Transformers version: 4.25.1
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
patrickvonplaten commented 1 year ago

Hey @FlameLaw,

The architecture of Stable Diffusion sadly requires you to pass encoder_hidden_states as otherwise the forward computation graph doesn't work.

ghost commented 1 year ago

Thanks for the reply!