huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
25.42k stars 5.27k forks source link

SD3 Token Limit? #8500

Closed neuron-party closed 2 months ago

neuron-party commented 3 months ago

Describe the bug

From the SD3 research paper, it seems that the new context window (token limit) should be 77 + 77 = 154? However, submitting a prompt longer than 77 tokens gives a truncation warning.

Reproduction

any prompt with > 77 tokens

Logs

No response

System Info

diffusers source

Who can help?

No response

rolux commented 3 months ago

Encoding a long example prompt from the SD3 paper:

prompt = (
    "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus. "
    "This imaginative creature features the distinctive, bulky body of a hippo, but with a texture and appearance resembling a golden-brown, crispy waffle. "
    "The creature might have elements like waffle squares across its skin and a syrup-like sheen. "
    "It's set in a surreal environment that playfully combines a natural water habitat of a hippo with elements of a breakfast table setting, possibly including oversized utensils or plates in the background. "
    "The image should evoke a sense of playful absurdity and culinary fantasy."
)

...results in the following three warnings:

The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['surreal environment that playfully combines a natural water habitat of a hippo with elements of a breakfast table setting, possibly including oversized utensils or plates in the background. the image should evoke a sense of playful absurdity and culinary fantasy.']
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['surreal environment that playfully combines a natural water habitat of a hippo with elements of a breakfast table setting, possibly including oversized utensils or plates in the background. the image should evoke a sense of playful absurdity and culinary fantasy.']
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ["like waffle squares across its skin and a syrup-like sheen. It's set in a surreal environment that playfully combines a natural water habitat of a hippo with elements of a breakfast table setting, possibly including oversized utensils or plates in the background. The image should evoke a sense of playful absurdity and culinary fantasy."]

The actual prompt_embeds (154 rows of 4096 values) look like this, using:

torch.clamp(torch.sign(embeds) * torch.sqrt(torch.abs(embeds)), -10, 10)

paper_prompt

Typical results show that the later parts of the prompt have in fact been disregarded:

A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hip  ,4101555727

This is the result presented in the SD3 paper:

waffle hippo

yiyixuxu commented 3 months ago

we are working on support longer t5 prompts here https://github.com/huggingface/diffusers/pull/8506 cc @asomoza

xhinker commented 3 months ago

Working on overcoming all three text model's token limitations with prompt weighting, seems working good, one preview sample: image

asomoza commented 3 months ago

To share and as you seem to be working on the same, I found that instead of making the clip models accept more tokens which might degrade the generation, it's better to make the prompt for them more robust and include only the main subjects and important details and use a separate prompt for the T5 to add more intricate details and spatial composition.

20240617090526_1620204366

20240617091431_606373744

yiyixuxu commented 3 months ago

should be fixed by this https://github.com/huggingface/diffusers/pull/8506 do we want to close this issue or keep it open for more discussion?

asomoza commented 3 months ago

I think it's ok to close it, I'll probably open a discussion about it at a later time with some insights I found about this. @xhinker are you going to code a SD3 lpw pipeline?

xhinker commented 3 months ago

I think it's ok to close it, I'll probably open a discussion about it at a later time with some insights I found about this. @xhinker are you going to code a SD3 lpw pipeline?

Actually, I have almost finished an independent module that can process the long weighted prompt for SD3, and also SD15 and SDXL.

Instead of building a custom lpw pipeline, wondering if there is a way to provide it as a tool, so that Diffusers users can decide to use it to generate any length embedding or use the default text encoders from Diffusers StableDiffusion3Pipeline.

@asomoza Thoughts?

asomoza commented 3 months ago

I like the idea, there is a PR about this. I really like the idea of adding a StableDiffusionLongPromptProcessor or something similar, but it can also be an external tool with it's own repo.

@yiyixuxu will something like this conflict with that PR?

yiyixuxu commented 3 months ago

@asomoza @xhinker that PR is stalled and was not anywhere near a state can be merged so feel free to open a PR!

neuron-party commented 3 months ago

is this how they did it in the SD3 paper? i.e assigning prompts >= 77 tokens to the T5 encoder only and truncating it for clip encoders?

asomoza commented 3 months ago

In the paper they don't say anything related to the T5 training. We know that the base model has a 512 token limit but we don't know (officially) how many tokens they used to train it or if they really did train it with more than 77.

For example Pixart Alpha uses 120 and Sigma 300 and they say so in each paper.

Probably there isn't a single base model that trained the clip models with more than 77 tokens, this is just my own opinion and I can be wrong, but doing so will make the training worse or at least make it take a lot more time to learn things, depending on the technique used to go over the limit. Take note that I mean training from scratch not fine tuning.

P.S.: If we go by the formulas, then we can say they trained the T5 with 77 tokens.

neuron-party commented 3 months ago

@asomoza gotcha. could you provide a quick code snippet for using the shorter/robust prompt for clip encoders and the longer prompt for just the T5 for sd3 inference? would be really helpful. thanks!

asomoza commented 3 months ago

yeah, sure, with the same example with the hippo:


prompt = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus, basking in a river of melted butter amidst a breakfast-themed landscape. A river of warm, melted butter, pancake-like foliage in the background, a towering pepper mill standing in for a tree."
prompt_t5 = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus, basking in a river of melted butter amidst a breakfast-themed landscape. It features the distinctive, bulky body shape of a hippo. However, instead of the usual grey skin, the creature’s body resembles a golden-brown, crispy waffle fresh off the griddle. The skin is textured with the familiar grid pattern of a waffle, each square filled with a glistening sheen of syrup. The environment combines the natural habitat of a hippo with elements of a breakfast table setting, a river of warm, melted butter, with oversized utensils or plates peeking out from the lush, pancake-like foliage in the background, a towering pepper mill standing in for a tree.  As the sun rises in this fantastical world, it casts a warm, buttery glow over the scene. The creature, content in its butter river, lets out a yawn. Nearby, a flock of birds take flight"

image = pipe(
    prompt=prompt,
    prompt_3=prompt_t5,
    negative_prompt="",
    num_inference_steps=28,
    guidance_scale=4.5,
    generator=generator,
    max_sequence_length=512,
    width=1280,
    height=768,
).images[0]
prompt prompt_t5 prompt + prompt_t5
20240619171709_606373744 20240619171933_606373744 20240619171640_606373744

Still, I think this is just too new, maybe there's some other parameters that can make more of a difference, also depending on where the prompt gets truncated and how do you prompt, there could be little difference between using a different prompt for the T5 or not.

xhinker commented 3 months ago

@asomoza @neuron-party Upload the module that supports long prompt weighted prompt for SD3(unlimited for 2 CLIP, 512 max for T5), SDXL and SD15. https://github.com/xhinker/sd_embed

asomoza commented 3 months ago

thanks for your work, it looks really nice, I'll do some tests with it tomorrow.

asomoza commented 2 months ago

This was resolved with this #8506 and with https://github.com/xhinker/sd_embed, I'm closing it for now, feel free to open a new issue if there's any more questions.