Kandinsky 3.0 fails when passing in embeds instead of prompts

Describe the bug

Kandinsky 3.0 fails you you attempt to add embeds rather than prompts.

Since the text model for K3.0 is so heavy this is probably needed more the usual to reduce memory usage and speed, you really don't want to encode the text prompt multiple times if you can avoid it.

Traceback (most recent call last):
  File "/Volumes/SSD2TB/AI/Diffusers/k3.py", line 26, in <module>
    image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_prompt_embeds, num_inference_steps=25, generator=generator).images[0]
  File "/Volumes/SSD2TB/AI/Kandinsky3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Volumes/SSD2TB/AI/Kandinsky3/lib/python3.10/site-packages/diffusers/pipelines/kandinsky3/kandinsky3_pipeline.py", line 366, in __call__
    prompt_embeds, negative_prompt_embeds, attention_mask, negative_attention_mask = self.encode_prompt(
  File "/Volumes/SSD2TB/AI/Kandinsky3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Volumes/SSD2TB/AI/Kandinsky3/lib/python3.10/site-packages/diffusers/pipelines/kandinsky3/kandinsky3_pipeline.py", line 153, in encode_prompt
    attention_mask = attention_mask.repeat(num_images_per_prompt, 1)
UnboundLocalError: local variable 'attention_mask' referenced before assignment

Reproduction

from diffusers import AutoPipelineForText2Image
import torch
import gc

pipe = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16)
pipe = pipe.to('mps')

prompt = "A photograph of the inside of a subway train. There are raccoons sitting on the seats. One of them is reading a newspaper. The window shows the city in the background."

prompt_embeds, negative_prompt_embeds, attention_mask, negative_attention_mask = pipe.encode_prompt(
             prompt,
             True,
             device=pipe.device
         )

#pipe.text_encoder = None
#pipe.tokenizer = None
#gc.collect()
#torch.mps.empty_cache()
#gc.collect()
#torch.mps.empty_cache()

generator = torch.Generator(device="cpu").manual_seed(42)
image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_prompt_embeds, num_inference_steps=25, generator=generator).images[0]

image[0].save('k3.png')

Logs

Traceback (most recent call last):
  File "/Volumes/SSD2TB/AI/Diffusers/k3.py", line 26, in <module>
    image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_prompt_embeds, num_inference_steps=25, generator=generator).images[0]
  File "/Volumes/SSD2TB/AI/Kandinsky3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Volumes/SSD2TB/AI/Kandinsky3/lib/python3.10/site-packages/diffusers/pipelines/kandinsky3/kandinsky3_pipeline.py", line 366, in __call__
    prompt_embeds, negative_prompt_embeds, attention_mask, negative_attention_mask = self.encode_prompt(
  File "/Volumes/SSD2TB/AI/Kandinsky3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Volumes/SSD2TB/AI/Kandinsky3/lib/python3.10/site-packages/diffusers/pipelines/kandinsky3/kandinsky3_pipeline.py", line 153, in encode_prompt
    attention_mask = attention_mask.repeat(num_images_per_prompt, 1)
UnboundLocalError: local variable 'attention_mask' referenced before assignment

System Info

diffusers version: 0.24.0.dev0
Platform: macOS-14.1.1-arm64-arm-64bit
Python version: 3.10.13
PyTorch version (GPU?): 2.1.1 (False)
Huggingface_hub version: 0.19.4
Transformers version: 4.35.2
Accelerate version: 0.24.1
xFormers version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

No response

huggingface / diffusers

Kandinsky 3.0 fails when passing in embeds instead of prompts #5963

Describe the bug

Reproduction

Logs

System Info

Who can help?