huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
25.64k stars 5.3k forks source link

error in using stable cascade with long prompt #7672

Closed saeedkhanehgir closed 5 months ago

saeedkhanehgir commented 6 months ago

Hi,

When I use stable cascade model with long prompt, I get below error.

Token indices sequence length is longer than the specified maximum sequence length for this model (165 > 77). Running this sequence through the model will result in indexing errors

I try to use compel library to fix this problem, but it doesn't work.

Thanks

DN6 commented 6 months ago

@saeedkhanehgir Can you share a code example that produces this error? As well as the full traceback. Currently the maximum supported length for a prompt is SD Cascade is 77 tokens, but the prompt should be getting truncated with a warning.

saeedkhanehgir commented 6 months ago

@DN6 Thanks for your answer

here is my inference code.

import torch
from diffusers import StableCascadeDecoderPipeline, StableCascadePriorPipeline

prompt = "Portrait of an Asian woman, facing the audience, looking at the viewers, long black hair, facing the camera, wearing a t-shirt with the inscription 'SmiLe editing', denim jacket and short curvy fat body, standing at the edge of the river, with waterfalls and mountains in the forest as background, bright blue cloudy sky, close-up, realistic, 32k, HDR"
negative_prompt = ""

prior = StableCascadePriorPipeline.from_pretrained("stabilityai/stable-cascade-prior", variant="bf16", torch_dtype=torch.bfloat16).to('cuda')
decoder = StableCascadeDecoderPipeline.from_pretrained("stabilityai/stable-cascade", variant="bf16", torch_dtype=torch.float16).to('cuda')

prior_output = prior(
    prompt=prompt,
    height=1024,
    width=1024,
    negative_prompt=negative_prompt,
    guidance_scale=4.0,
    num_images_per_prompt=1,
    num_inference_steps=20,
)

decoder_output = decoder(
    image_embeddings=prior_output.image_embeddings.to(torch.float16),
    prompt=prompt,
    negative_prompt=negative_prompt,
    guidance_scale=0.0,
    output_type="pil",
    num_inference_steps=10,
).images[0]
decoder_output.save("cascade.png")

and this is message

Token indices sequence length is longer than the specified maximum sequence length for this model (79 > 77). Running this sequence through the model will result in indexing errors
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: [', hdr']
bghira commented 6 months ago

that's actually not an error, it's a warning. and the "part of your input was truncated" message indicates it works as expected.

the message still shows up with Compel, but not the part about truncating the prompt.

the way the long prompt handling is implemented isn't great, but there's hardly many other options. it lobotomises the positional embed. and it's especially an issue with models with pooled embeds, where things get hairy.

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

DN6 commented 5 months ago

@saeedkhanehgir Closing this issue for now since the pipeline isn't throwing an error. For help with dealing with long prompts, it might be better to open a thread in the Discussions section.

duonglegiang commented 2 months ago

Hi @saeedkhanehgir, Can you share source code using compel for Stable Cascade? Thank you