Closed woctezuma closed 1 year ago
@woctezuma thanks!!! Is the base the only checkpoint available for the base diffusion model? I cannot reproduce the same results as showed in paper images "Figure 1" with the same text prompt. In the references I can see also CLIP guided diffusion models for both 2566x256 and 512x512.
Crowson, K. Clip guided diffusion hq 256x256 https://colab.research.google.com/drive/12a_Wrfi2_gwwAuN3VvMTwVMz9TfqctNj, 2021a.Crowson,K.Clip guided diffusion 512x512,secondarymodelmethod https://twitter.com/RiversHaveWings/status/1462859669454536711, 2021b.
I cannot reproduce the same results as showed in paper images "Figure 1" with the same text prompt.
Unfortunately, this is normal, because the publicly available model:
You should get outputs similar to the third row of Figure 9.
From a user perspective, the main benefit of GLIDE is that it is much faster than the CLIP-guided methods which I have tried so far.
Is the base the only checkpoint available for the base diffusion model?
I think so. From what I can see in the code below, there are 6 checkpoints:
I see the nice following commits:
add install command to notebooks
-> git and pip at the start of the notebook,add colab links
-> Colab badges for the links in the README,colab GPU backend
-> GPU support toggled ON.@woctezuma thanks! I can see that the sampling part is slightly different than yours, adding the model_fn
function to the sample loop. Is this related to the fact that they just do free guidance (cond_fn=None
) rather than clip guidance like in your colab? Also, I have tried to combine the last two, and results seems to be better, like if clip guidance, for the small
model introduces too much randomness. Any idea why?
# Create the text tokens to feed to the model.
tokens = model.tokenizer.encode(prompt)
tokens, mask = model.tokenizer.padded_tokens_and_mask(
tokens, options['text_ctx']
)
# Create the classifier-free guidance tokens (empty)
full_batch_size = batch_size * 2
uncond_tokens, uncond_mask = model.tokenizer.padded_tokens_and_mask(
[], options['text_ctx']
)
# Pack the tokens together into model kwargs.
model_kwargs = dict(
tokens=th.tensor(
[tokens] * batch_size + [uncond_tokens] * batch_size, device=device
),
mask=th.tensor(
[mask] * batch_size + [uncond_mask] * batch_size,
dtype=th.bool,
device=device,
),
)
# Create a classifier-free guidance sampling function
def model_fn(x_t, ts, **kwargs):
half = x_t[: len(x_t) // 2]
combined = th.cat([half, half], dim=0)
model_out = model(combined, ts, **kwargs)
eps, rest = model_out[:, :3], model_out[:, 3:]
cond_eps, uncond_eps = th.split(eps, len(eps) // 2, dim=0)
half_eps = uncond_eps + guidance_scale * (cond_eps - uncond_eps)
eps = th.cat([half_eps, half_eps], dim=0)
return th.cat([eps, rest], dim=1)
# Sample from the base model.
model.del_cache()
samples = diffusion.p_sample_loop(
model_fn,
(full_batch_size, 3, options["image_size"], options["image_size"]),
device=device,
clip_denoised=True,
progress=True,
model_kwargs=model_kwargs,
cond_fn=None,
)[:batch_size]
model.del_cache()
# Show the output
show_images(samples)
I can see that the sampling part is slightly different than yours, adding the
model_fn
function to the sample loop. Is this related to the fact that they just do free guidance (cond_fn=None
) rather than clip guidance like in your colab?
To clarify any confusion:
when cond_fn
is not None
, I assume you are looking at the CLIP-guided approach:
https://github.com/openai/glide-text2im/blob/main/notebooks/clip_guided.ipynb
the notebook linked in my first post is the classifier-free guidance, with cond_fn=None
, copied from:
https://github.com/openai/glide-text2im/blob/main/notebooks/text2im.ipynb
Unless I am missing something, the model_fn
function is added to the sample loop in both notebooks called text2im.ipynb
.
# Sample from the base model.
model.del_cache()
samples = diffusion.p_sample_loop(
model_fn,
(full_batch_size, 3, options["image_size"], options["image_size"]),
device=device,
clip_denoised=True,
progress=True,
model_kwargs=model_kwargs,
cond_fn=None,
)[:batch_size]
model.del_cache()
Also, I have tried to combine the last two, and results seems to be better, like if clip guidance, for the
small
model introduces too much randomness. Any idea why?
I need to see the diff of what you did to understand better.
I would be glad to test this and see the results, if they are better. :) The black cat with white paws looks nice. 👍
Thanks! I have two versions, this one
samples = diffusion.p_sample_loop(
model,
(batch_size, 3, options["image_size"], options["image_size"]),
device=device,
clip_denoised=True,
progress=True,
model_kwargs=model_kwargs,
cond_fn=cond_fn,
)
where
cond_fn = clip_model.cond_fn([prompt] * batch_size, guidance_scale)
and in the latest colab from the repo
samples = diffusion.p_sample_loop(
model_fn,
(full_batch_size, 3, options["image_size"], options["image_size"]),
device=device,
clip_denoised=True,
progress=True,
model_kwargs=model_kwargs,
cond_fn=None,
)[:batch_size]
with cond_fn=None
and as model_fn
def model_fn(x_t, ts, **kwargs):
half = x_t[: len(x_t) // 2]
combined = th.cat([half, half], dim=0)
model_out = model(combined, ts, **kwargs)
eps, rest = model_out[:, :3], model_out[:, 3:]
cond_eps, uncond_eps = th.split(eps, len(eps) // 2, dim=0)
half_eps = uncond_eps + guidance_scale * (cond_eps - uncond_eps)
eps = th.cat([half_eps, half_eps], dim=0)
return th.cat([eps, rest], dim=1)
I have slightly changed the structure of your
text2im
notebook so that:Run
text2im.ipynb
Reference: https://github.com/woctezuma/glide-text2im-colab