huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
25.24k stars 5.22k forks source link

prompt endoftext bug - results with prompt > max-length are way better then prompt with length <= max-length #1165

Closed petekay closed 1 year ago

petekay commented 1 year ago

Describe the bug

Hi I trained a personal model (key: smnb, class: person) with dreambooth, and i found, that I can't replicate the results with automatic1111 on the one side and huggingface/transformers + notebook on the other side.

the fascinating thing is, that if I use exactly max_words + 1, then my prompt is shortened with the word but also with the endoftext-token, and this creates fantastic results.

scheduler = DDIMScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", clip_sample=False, set_alpha_to_one=False)
safety_checker = None

pipe = StableDiffusionPipeline.from_pretrained(model_path, torch_dtype=torch.float16, revision="fp16", scheduler=scheduler, safety_checker = safety_checker).to("cuda")
g_cuda = None

# %%
#@title Run for generating images.
g_cuda = torch.Generator(device='cuda')
seed = 1117437330 #@param {type:"number"}
g_cuda.manual_seed(seed)

prompt = "Ultrawide realistic photo of a smnb person viking men, leading a battle, battle-scarred mind-blowing details, ethereal, ominous, scarred, highly detailed, viking attire, cinematic, 16k, 1080s, smooth, sharp focus, by stanley artgermm, tom bagshaw, greg rutkowski, vincent di fate, carne griffiths, ayami kojima, trending on deviantart, hyper detailed, full of color, digital art, vibrant colors, smooth gradients, high contrast, depth of field, shot on canon camera" #@param {type:"string"}
negative_prompt = "" #@param {type:"string"}
num_samples = 1 #@param {type:"number"}
guidance_scale = 10 #@param {type:"number"}
num_inference_steps = 50 #@param {type:"number"}
height = 512 #@param {type:"number"}
width = 512 #@param {type:"number"}

with autocast("cuda"), torch.inference_mode():
    images = pipe(
        prompt,
        height=height,
        width=width,
        negative_prompt=negative_prompt,
        num_images_per_prompt=num_samples,
        num_inference_steps=num_inference_steps,
        guidance_scale=guidance_scale,
        generator=g_cuda
    ).images

for img in images:
    display(img)

this runs fine, and produces this image: grafik

but if I add a comma to the prompt and rerun the code (of couse with same seed), I get this error but a very nice picture:

so the exact prompt is: prompt = "Ultrawide realistic photo of a smnb person viking men, leading a battle, battle-scarred mind-blowing details, ethereal, ominous, scarred, highly detailed, viking attire, cinematic, 16k, 1080s, smooth, sharp focus, by stanley artgermm, tom bagshaw, greg rutkowski, vincent di fate, carne griffiths, ayami kojima, trending on deviantart, hyper detailed, full of color, digital art, vibrant colors, smooth gradients, high contrast, depth of field, shot on canon camera,"

--> ("," is the last character)

The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['<|endoftext|>']

grafik

(it's even so similar to the real person, that I hide the eyes here :) )

if you think, this is by accident, let's compare another seed: seed = 1117437320 prompt with comma (and endoftext cutoff?) grafik

without comma (no cut-off, prompt length = ok): grafik

seed 1117437334: with comma prompt: grafik

without comma (no cut-off, prompt length = ok): grafik

this goes on and on, of course this is not because of the "," itself, but I think that the forward-step is better if the endoftext prompt is cut-off, can someone prove these results with other trained models?

i compared the prompts with 20 different seeds, of course this is not a very big "study", but in my case: 6-7 results from 20 in the with-comma-edition were really good and you can with high similarity to the original picture/person.

in the non-comma-edition, I would say that none of the 20 pictures looked like the original.

maybe related issues: https://github.com/facebookresearch/SLIP/issues/18

related code: https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L278

Reproduction

  1. Train keyword person model with dreambooth
  2. create images with maximal prompt length + 1 sign (",")
  3. compare it to prompt up to max length or less.

You can also add additional clips, which also get cut-off:

I added",a,b,c" prompt = "[identical from above] carne griffiths,a,b,c" which resulted in this message (and again the great output/pictures): The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['a, b, c <|endoftext|>']

In any case, this is not the same like putting the prompt to: prompt = "[identical from above] carne griffiths" which most people would assume, but the prompt-cutoff cuts the flagword "endoftext" off (I assume).

I replicated in locally with these packages in windows + conda, but I get also replicate it exactly on google colab. Perfect replication is just for me possible, because I have my trained model, but if this is a bug, you could replicate it, with some real person pictures like I did.

Logs

No response

System Info

important package list:

diffusers==0.6.0 torch==1.12.1 torch-fidelity==0.3.0 torchaudio==0.12.1 torchmetrics==0.6.0 torchvision==0.13.1 transformers==4.18.0

python==3.8.10

averad commented 1 year ago

I noticed this today as well and can confirm prompts with <|endoftext|> removed have better results than prompts that include <|endoftext|> in the 20 or so tests I completed using stable_diffusion_v1-5-vae_mse_onnx.

Tested using:

I will complete some further testing and try and support my initial findings.

averad commented 1 year ago

Txt2Img - Stable Diffusion v1.5 - Test Prompt Max Length Impact on Image Results

Testing environment was setup following: Stable Diffusion for AMD GPUs on Windows using DirectML

Findings:

Testing Parameters

Test Results

Test_1: cat

11062022-160057 - Model: ./models/sd-onnx-v1-5
11062022-160057 - Prompt: cat
11062022-160057 - Neg_Prompt: 
11062022-160057 - Inference Steps: 50 Guidance Scale: 7.5 Width: 512 Height: 512
11062022-160057 - Seed: 168895681900500

Test_2: cat<|endoftext|>

11062022-160345 - Model: ./models/sd-onnx-v1-5
11062022-160345 - Prompt: cat<|endoftext|>
11062022-160345 - Neg_Prompt: 
11062022-160345 - Inference Steps: 50 Guidance Scale: 7.5 Width: 512 Height: 512
11062022-160345 - Seed: 168895681900500

Test_3: cat<|endoftext|><|endoftext|>...repeat to max CLIP allowed amount

Test_4: cat,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, [ , x 74 ]

11062022-161757 - Model: ./models/sd-onnx-v1-5
11062022-161757 - Prompt: cat,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
11062022-161757 - Neg_Prompt: 
11062022-161757 - Inference Steps: 50 Guidance Scale: 7.5 Width: 512 Height: 512
11062022-161757 - Seed: 168895681900500

Test_5: cat,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, [ , x 75 ]

Test_6: cat,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, [ , x 76 ]

Test_7: cat,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, [ , x 73 ]

11062022-180605 - Model: ./models/sd-onnx-v1-5
11062022-180605 - Prompt: cat,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
11062022-180605 - Neg_Prompt: 
11062022-180605 - Inference Steps: 50 Guidance Scale: 7.5 Width: 512 Height: 512
11062022-180605 - Seed: 168895681900500
petekay commented 1 year ago

You can also use the default pipeline, so this is also a bug regarding the text cut-off:

    pipe = StableDiffusionPipeline.from_pretrained(
        "CompVis/stable-diffusion-v1-4", 
        use_auth_token=True
    ).to("cuda")

Using prompt1 "Ultrawide realistic photo of a smnb person viking men, leading a battle, battle-scarred mind-blowing details, ethereal, ominous, scarred, highly detailed, viking attire, cinematic, 16k, 1080s, smooth, sharp focus, by stanley artgermm, tom bagshaw, greg rutkowski, vincent di fate, carne griffiths,"

should have the exact same result like prompt2: "Ultrawide realistic photo of a smnb person viking men, leading a battle, battle-scarred mind-blowing details, ethereal, ominous, scarred, highly detailed, viking attire, cinematic, 16k, 1080s, smooth, sharp focus, by stanley artgermm, tom bagshaw, greg rutkowski, vincent di fate, carne griffiths"

because the "," is too much, the code cuts the "," AND the EOT-flag, so the results are not the same/deterministic as they should. [“token73, token74, token75”] : ok -> + EOT flag [“token73, token74, token75,”] : to much tokens -> without EOT flag

In summary, we found two properties:

  1. the cut-off behaves not like expected, because its cut's the EOT-flag also
  2. results based on shortened prompts by the CLIPer, are looking better (subjective opinion)
patrickvonplaten commented 1 year ago

1.) Very good find the cut-off indeed does not behave as it should! We should always add a EOS to the end. Will open a PR to fix this!

2.) Interesting, not sure if this mean we should change anything

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.