max_sequence_length=512 doesn't have any effect on SD3 # of tokens

tin2tin commented 2 months ago

Describe the bug

Reproduction

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_single_file(
    "https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium_incl_clips_t5xxlfp8.safetensors",
    torch_dtype=torch.float16,
)
pipe.enable_model_cpu_offload()
prompt = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus. This imaginative creature features the distinctive, bulky body of a hippo, but with a texture and appearance resembling a golden-brown, crispy waffle. The creature might have elements like waffle squares across its skin and a syrup-like sheen. It’s set in a surreal environment that playfully combines a natural water habitat of a hippo with elements of a breakfast table setting, possibly including oversized utensils or plates in the background. The image should evoke a sense of playful absurdity and culinary fantasy."

image = pipe(
    prompt=prompt,
    negative_prompt="",
    num_inference_steps=28,
    guidance_scale=4.5,
    max_sequence_length=512,
).images[0]
image.save('sd3-single-file-t5-fp8.png')

Logs

Token indices sequence length is longer than the specified maximum sequence length for this model (124 > 77). Running this sequence through the model will result in indexing errors
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['surreal environment that playfully combines a natural water habitat of a hippo with elements of a breakfast table setting, possibly including oversized utensils or plates in the background. the image should evoke a sense of playful absurdity and culinary fantasy.']
100%|██████████████████████████████████████████████████████████████████████████████████| 28/28 [00:14<00:00,  1.94it/s]

System Info

Diffusers 0.29.2 & 0.30.0 dev Win 11

Who can help?

@yiyixuxu @sayakpaul @DN6 @asomoza

asomoza commented 2 months ago

Hi, max_sequence_length it's only for the T5 Text Encoder, the warning you're getting refers to the Clip Text Encoders which will still have the 77 token limit.

This is in the documentation where it says:

The prompt with the CLIP Text Encoders is still truncated to the 77 token limit.

If you want to use long prompts and even weighting, you can use sd-embed which is a very good and small library.

tin2tin commented 2 months ago

So, doing this is a no go?

image = pipe(
    prompt="",
    prompt_3=prompt,
    negative_prompt="",
    num_inference_steps=28,
    guidance_scale=4.5,
    max_sequence_length=512,
).images[0]
image.save('sd3-single-file-t5-fp8.png')

asomoza commented 2 months ago

That works, the warning and what gets truncated is the prompt for the Clip Text Encoders, those are the same ones as the SDXL models, the T5 is the big model that gives you the prompt adherence.

So in short:

prompt, prompt_2 gets truncated at 77 tokens prompt_3 uses the max_sequence_length

You can use a separate prompt for each if you want, like in this example.

In my tests this is what works best. I didn't see any improvements with sending also a big prompt to the Clip Text Encoders.

asomoza commented 2 months ago

Also to clear the misunderstanding, your first example also works, even if the prompt gets truncated you will still see the image to get the additional details of the long prompt because the T5 prompt doesn't get truncated.

tin2tin commented 2 months ago

Oh, so the full prompt will be used even if there is the truncated message? (If so, that IS confusing)

I think what also confused me is that the PixArt Sigma models comes with a much higher tokens limit in the prompt option, without any further ado.

asomoza commented 2 months ago

I know it seems complicated, we're thinking of doing a glossary so people can understand this better.

Pirxart uses a single Text Encoder, so it doesn't have this problem, it uses only the T5 which is a LLM, so it only handles text.

The Clip Text Encoders are smaller models which in simpler terms mix the images with text, they don't have the capacity of an LLM.

SD3 uses 3 text encoders, the T5 which allows long prompts and 2 Clip Text Encoders which have a low token limit (same as SD 1.5 and SDXL).

What is happening is that the prompt you're sending is sent to the three of them at the same time and gets truncated just for the Clip Text Encoders at the 77 limit, the T5 still receives the full prompt, same as pixart so it still generates the image with the full promp.

We didn't hide the warning because in reality the prompt is getting truncated for those models so it's better if people are aware of this instead of hiding it and make users think the full prompt is also being used by those models.

huggingface / diffusers