In stable diffusion pipeline, I want to put None in encoder_hidden_states to get the true unconditional noise_pred, not with a text embedding of blank strings. However, this is giving me the error:
RuntimeError: mat1 and mat2 shapes cannot be multiplied (4096x320 and 768x320)
How should I fix this? How do I make the unet skip cross attention?
Reproduction
_, text_embeddings = text_embeddings.chunk(2)
# expand the latents if we are doing classifier free guidance
latent_model_input = latents
latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
# predict the noise residual
noise_pred_text = self.unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
# perform guidance
if do_classifier_free_guidance:
for downsample_block in self.unet.down_blocks:
downsample_block.has_cross_attention = False
for upsample_block in self.unet.up_blocks:
upsample_block.has_cross_attention = False
noise_pred_uncond = self.unet(latent_model_input, t, encoder_hidden_states=None).sample
noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
# compute the previous noisy sample x_t -> x_t-1
latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs).prev_sample
Logs
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[3], line 9
7 result = []
8 pipe.text_encoder.config.use_attention_mask = False
----> 9 for image in pipe.txt2img(prompt=prompt, negative_prompt=negative_prompt, seed=seed, num_inference_steps=num_inference_steps, guidance_scale=guidance_scale):
10 result += image
11 pipe.text_encoder.config.use_attention_mask = True
File c:\CODE\svelte-diffusion\venv\lib\site-packages\torch\autograd\grad_mode.py:43, in _DecoratorContextManager._wrap_generator.<locals>.generator_context(*args, **kwargs)
40 try:
41 # Issuing `None` to a generator fires it up
42 with self.clone():
---> 43 response = gen.send(None)
45 while True:
46 try:
47 # Forward the response to our caller and get its next request
File c:\CODE\svelte-diffusion\custom_pipe.py:192, in StableDiffusionGigaPipeline.txt2img(self, prompt, height, width, num_inference_steps, guidance_scale, negative_prompt, num_images_per_prompt, eta, seed, latents, output_type, return_dict, callback, callback_steps, batch_size, unet_cross_attention)
190 if batch != 0:
191 generator = None
--> 192 yield txt2img(self,
193 prompt=prompt,
194 height=height,
195 width=width,
...
File c:\CODE\svelte-diffusion\venv\lib\site-packages\torch\nn\modules\linear.py:114, in Linear.forward(self, input)
113 def forward(self, input: Tensor) -> Tensor:
--> 114 return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (4096x320 and 768x320)
### System Info
- `diffusers` version: 0.11.0
- Platform: Windows-10-10.0.19045-SP0
- Python version: 3.10.8
- PyTorch version (GPU?): 1.13.0+cu117 (True)
- Huggingface_hub version: 0.11.1
- Transformers version: 4.25.1
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
Describe the bug
Not sure if this is a bug or not.
In stable diffusion pipeline, I want to put None in encoder_hidden_states to get the true unconditional noise_pred, not with a text embedding of blank strings. However, this is giving me the error:
RuntimeError: mat1 and mat2 shapes cannot be multiplied (4096x320 and 768x320)
How should I fix this? How do I make the unet skip cross attention?
Reproduction
Logs