huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
25.99k stars 5.35k forks source link

Kandinsky 3.0 fails when passing in embeds instead of prompts #5963

Closed Vargol closed 11 months ago

Vargol commented 11 months ago

Describe the bug

Kandinsky 3.0 fails you you attempt to add embeds rather than prompts.

Since the text model for K3.0 is so heavy this is probably needed more the usual to reduce memory usage and speed, you really don't want to encode the text prompt multiple times if you can avoid it.

Traceback (most recent call last):
  File "/Volumes/SSD2TB/AI/Diffusers/k3.py", line 26, in <module>
    image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_prompt_embeds, num_inference_steps=25, generator=generator).images[0]
  File "/Volumes/SSD2TB/AI/Kandinsky3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Volumes/SSD2TB/AI/Kandinsky3/lib/python3.10/site-packages/diffusers/pipelines/kandinsky3/kandinsky3_pipeline.py", line 366, in __call__
    prompt_embeds, negative_prompt_embeds, attention_mask, negative_attention_mask = self.encode_prompt(
  File "/Volumes/SSD2TB/AI/Kandinsky3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Volumes/SSD2TB/AI/Kandinsky3/lib/python3.10/site-packages/diffusers/pipelines/kandinsky3/kandinsky3_pipeline.py", line 153, in encode_prompt
    attention_mask = attention_mask.repeat(num_images_per_prompt, 1)
UnboundLocalError: local variable 'attention_mask' referenced before assignment

Reproduction

from diffusers import AutoPipelineForText2Image
import torch
import gc

pipe = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16)
pipe = pipe.to('mps')

prompt = "A photograph of the inside of a subway train. There are raccoons sitting on the seats. One of them is reading a newspaper. The window shows the city in the background."

prompt_embeds, negative_prompt_embeds, attention_mask, negative_attention_mask = pipe.encode_prompt(
             prompt,
             True,
             device=pipe.device
         )

#pipe.text_encoder = None
#pipe.tokenizer = None
#gc.collect()
#torch.mps.empty_cache()
#gc.collect()
#torch.mps.empty_cache()

generator = torch.Generator(device="cpu").manual_seed(42)
image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_prompt_embeds, num_inference_steps=25, generator=generator).images[0]

image[0].save('k3.png')

Logs

Traceback (most recent call last):
  File "/Volumes/SSD2TB/AI/Diffusers/k3.py", line 26, in <module>
    image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_prompt_embeds, num_inference_steps=25, generator=generator).images[0]
  File "/Volumes/SSD2TB/AI/Kandinsky3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Volumes/SSD2TB/AI/Kandinsky3/lib/python3.10/site-packages/diffusers/pipelines/kandinsky3/kandinsky3_pipeline.py", line 366, in __call__
    prompt_embeds, negative_prompt_embeds, attention_mask, negative_attention_mask = self.encode_prompt(
  File "/Volumes/SSD2TB/AI/Kandinsky3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Volumes/SSD2TB/AI/Kandinsky3/lib/python3.10/site-packages/diffusers/pipelines/kandinsky3/kandinsky3_pipeline.py", line 153, in encode_prompt
    attention_mask = attention_mask.repeat(num_images_per_prompt, 1)
UnboundLocalError: local variable 'attention_mask' referenced before assignment

System Info

Who can help?

No response

Vargol commented 11 months ago

Hi, tested the new merge and it solves the issue as expected. I can now chuck the text encoder out of memory and get a big speed up even when not looping the calls to the pipe.