Closed petekay closed 1 year ago
I noticed this today as well and can confirm prompts with <|endoftext|> removed have better results than prompts that include <|endoftext|> in the 20 or so tests I completed using stable_diffusion_v1-5-vae_mse_onnx.
Tested using:
diffusers
version: 0.7.2I will complete some further testing and try and support my initial findings.
Testing environment was setup following: Stable Diffusion for AMD GPUs on Windows using DirectML
11062022-160057 - Model: ./models/sd-onnx-v1-5
11062022-160057 - Prompt: cat
11062022-160057 - Neg_Prompt:
11062022-160057 - Inference Steps: 50 Guidance Scale: 7.5 Width: 512 Height: 512
11062022-160057 - Seed: 168895681900500
11062022-160345 - Model: ./models/sd-onnx-v1-5
11062022-160345 - Prompt: cat<|endoftext|>
11062022-160345 - Neg_Prompt:
11062022-160345 - Inference Steps: 50 Guidance Scale: 7.5 Width: 512 Height: 512
11062022-160345 - Seed: 168895681900500
11062022-160632 - Model: ./models/sd-onnx-v1-5
11062022-160632 - Prompt: cat<|endoftext|><|endoftext|><|endoftext|>...<|endoftext|>
11062022-160632 - Neg_Prompt:
11062022-160632 - Inference Steps: 50 Guidance Scale: 7.5 Width: 512 Height: 512
11062022-160632 - Seed: 168895681900500
11062022-161757 - Model: ./models/sd-onnx-v1-5
11062022-161757 - Prompt: cat,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
11062022-161757 - Neg_Prompt:
11062022-161757 - Inference Steps: 50 Guidance Scale: 7.5 Width: 512 Height: 512
11062022-161757 - Seed: 168895681900500
11062022-161602 - Model: ./models/sd-onnx-v1-5
11062022-161602 - Prompt: cat,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
11062022-161602 - Neg_Prompt:
11062022-161602 - Inference Steps: 50 Guidance Scale: 7.5 Width: 512 Height: 512
11062022-161602 - Seed: 168895681900500
11062022-173708 - Model: ./models/sd-onnx-v1-5
11062022-173708 - Prompt: cat,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
11062022-173708 - Neg_Prompt:
11062022-173708 - Inference Steps: 50 Guidance Scale: 7.5 Width: 512 Height: 512
11062022-173708 - Seed: 168895681900500
11062022-180605 - Model: ./models/sd-onnx-v1-5
11062022-180605 - Prompt: cat,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
11062022-180605 - Neg_Prompt:
11062022-180605 - Inference Steps: 50 Guidance Scale: 7.5 Width: 512 Height: 512
11062022-180605 - Seed: 168895681900500
You can also use the default pipeline, so this is also a bug regarding the text cut-off:
pipe = StableDiffusionPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4",
use_auth_token=True
).to("cuda")
Using prompt1
"Ultrawide realistic photo of a smnb person viking men, leading a battle, battle-scarred mind-blowing details, ethereal, ominous, scarred, highly detailed, viking attire, cinematic, 16k, 1080s, smooth, sharp focus, by stanley artgermm, tom bagshaw, greg rutkowski, vincent di fate, carne griffiths,"
should have the exact same result like prompt2:
"Ultrawide realistic photo of a smnb person viking men, leading a battle, battle-scarred mind-blowing details, ethereal, ominous, scarred, highly detailed, viking attire, cinematic, 16k, 1080s, smooth, sharp focus, by stanley artgermm, tom bagshaw, greg rutkowski, vincent di fate, carne griffiths"
because the "," is too much, the code cuts the "," AND the EOT-flag, so the results are not the same/deterministic as they should. [“token73, token74, token75”] : ok -> + EOT flag [“token73, token74, token75,”] : to much tokens -> without EOT flag
In summary, we found two properties:
1.) Very good find the cut-off indeed does not behave as it should! We should always add a EOS to the end. Will open a PR to fix this!
2.) Interesting, not sure if this mean we should change anything
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Describe the bug
Hi I trained a personal model (key:
smnb
, class:person
) with dreambooth, and i found, that I can't replicate the results with automatic1111 on the one side and huggingface/transformers + notebook on the other side.the fascinating thing is, that if I use exactly max_words + 1, then my prompt is shortened with the word but also with the endoftext-token, and this creates fantastic results.
this runs fine, and produces this image:
but if I add a comma to the prompt and rerun the code (of couse with same seed), I get this error but a very nice picture:
so the exact prompt is: prompt = "Ultrawide realistic photo of a smnb person viking men, leading a battle, battle-scarred mind-blowing details, ethereal, ominous, scarred, highly detailed, viking attire, cinematic, 16k, 1080s, smooth, sharp focus, by stanley artgermm, tom bagshaw, greg rutkowski, vincent di fate, carne griffiths, ayami kojima, trending on deviantart, hyper detailed, full of color, digital art, vibrant colors, smooth gradients, high contrast, depth of field, shot on canon camera,"
--> ("," is the last character)
(it's even so similar to the real person, that I hide the eyes here :) )
if you think, this is by accident, let's compare another seed: seed = 1117437320 prompt with comma (and endoftext cutoff?)
without comma (no cut-off, prompt length = ok):
seed 1117437334: with comma prompt:
without comma (no cut-off, prompt length = ok):
this goes on and on, of course this is not because of the "," itself, but I think that the forward-step is better if the endoftext prompt is cut-off, can someone prove these results with other trained models?
i compared the prompts with 20 different seeds, of course this is not a very big "study", but in my case: 6-7 results from 20 in the with-comma-edition were really good and you can with high similarity to the original picture/person.
in the non-comma-edition, I would say that none of the 20 pictures looked like the original.
maybe related issues: https://github.com/facebookresearch/SLIP/issues/18
related code: https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L278
Reproduction
keyword person
model with dreamboothYou can also add additional clips, which also get cut-off:
I added
",a,b,c"
prompt = "[identical from above] carne griffiths,a,b,c"
which resulted in this message (and again the great output/pictures):The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['a, b, c <|endoftext|>']
In any case, this is not the same like putting the prompt to:
prompt = "[identical from above] carne griffiths"
which most people would assume, but the prompt-cutoff cuts the flagword "endoftext" off (I assume).I replicated in locally with these packages in windows + conda, but I get also replicate it exactly on google colab. Perfect replication is just for me possible, because I have my trained model, but if this is a bug, you could replicate it, with some real person pictures like I did.
Logs
No response
System Info
important package list:
diffusers==0.6.0 torch==1.12.1 torch-fidelity==0.3.0 torchaudio==0.12.1 torchmetrics==0.6.0 torchvision==0.13.1 transformers==4.18.0
python==3.8.10