Colab notebook - Githubissues

openai / glide-text2im

GLIDE: a diffusion-based text-conditional image synthesis model

MIT License

3.54k stars 503 forks source link

Colab notebook #2

Closed woctezuma closed 1 year ago

woctezuma commented 2 years ago

I have slightly changed the structure of your text2im notebook so that:

it runs straight away on Google Colab with GPU toggled ON,
it is easier for everyone to try different prompts.

Run text2im.ipynb

Reference: https://github.com/woctezuma/glide-text2im-colab

loretoparisi commented 2 years ago

@woctezuma thanks!!! Is the base the only checkpoint available for the base diffusion model? I cannot reproduce the same results as showed in paper images "Figure 1" with the same text prompt. In the references I can see also CLIP guided diffusion models for both 2566x256 and 512x512.

Crowson, K. Clip guided diffusion hq 256x256 https://colab.research.google.com/drive/12a_Wrfi2_gwwAuN3VvMTwVMz9TfqctNj, 2021a.Crowson,K.Clip guided diffusion 512x512,secondarymodelmethod https://twitter.com/RiversHaveWings/status/1462859669454536711, 2021b.

woctezuma commented 2 years ago

I cannot reproduce the same results as showed in paper images "Figure 1" with the same text prompt.

Unfortunately, this is normal, because the publicly available model:

is smaller, as it has roughly 10x fewer parameters,
was trained on a filtered dataset.

You should get outputs similar to the third row of Figure 9.

Caption

From a user perspective, the main benefit of GLIDE is that it is much faster than the CLIP-guided methods which I have tried so far.

Is the base the only checkpoint available for the base diffusion model?

I think so. From what I can see in the code below, there are 6 checkpoints:

two for classifier-free guidance (sampling and upsampling),
two for inpainting (sampling and upsampling),
two for CLIP (text encoding and image encoding).

https://github.com/openai/glide-text2im/blob/742510effd841c94d2130480f2c74d3b32dc2eb0/glide_text2im/download.py#L10-L17

woctezuma commented 2 years ago

I see the nice following commits:

https://github.com/openai/glide-text2im/commit/146bd9c49a1bcf48541539af1344f09aca91893d add install command to notebooks -> git and pip at the start of the notebook,
https://github.com/openai/glide-text2im/commit/f4689080ef35a10b6ad68902a6381d4b0feb5017 add colab links -> Colab badges for the links in the README,
https://github.com/openai/glide-text2im/commit/9cc8e563851bd38f5ddb3e305127192cb0f02f5c colab GPU backend -> GPU support toggled ON.

loretoparisi commented 2 years ago

@woctezuma thanks! I can see that the sampling part is slightly different than yours, adding the model_fn function to the sample loop. Is this related to the fact that they just do free guidance (cond_fn=None) rather than clip guidance like in your colab? Also, I have tried to combine the last two, and results seems to be better, like if clip guidance, for the small model introduces too much randomness. Any idea why?


# Create the text tokens to feed to the model.
tokens = model.tokenizer.encode(prompt)
tokens, mask = model.tokenizer.padded_tokens_and_mask(
    tokens, options['text_ctx']
)

# Create the classifier-free guidance tokens (empty)
full_batch_size = batch_size * 2
uncond_tokens, uncond_mask = model.tokenizer.padded_tokens_and_mask(
    [], options['text_ctx']
)

# Pack the tokens together into model kwargs.
model_kwargs = dict(
    tokens=th.tensor(
        [tokens] * batch_size + [uncond_tokens] * batch_size, device=device
    ),
    mask=th.tensor(
        [mask] * batch_size + [uncond_mask] * batch_size,
        dtype=th.bool,
        device=device,
    ),
)

# Create a classifier-free guidance sampling function
def model_fn(x_t, ts, **kwargs):
    half = x_t[: len(x_t) // 2]
    combined = th.cat([half, half], dim=0)
    model_out = model(combined, ts, **kwargs)
    eps, rest = model_out[:, :3], model_out[:, 3:]
    cond_eps, uncond_eps = th.split(eps, len(eps) // 2, dim=0)
    half_eps = uncond_eps + guidance_scale * (cond_eps - uncond_eps)
    eps = th.cat([half_eps, half_eps], dim=0)
    return th.cat([eps, rest], dim=1)

# Sample from the base model.
model.del_cache()
samples = diffusion.p_sample_loop(
    model_fn,
    (full_batch_size, 3, options["image_size"], options["image_size"]),
    device=device,
    clip_denoised=True,
    progress=True,
    model_kwargs=model_kwargs,
    cond_fn=None,
)[:batch_size]
model.del_cache()

# Show the output
show_images(samples)

cat2

woctezuma commented 2 years ago

I can see that the sampling part is slightly different than yours, adding the model_fn function to the sample loop. Is this related to the fact that they just do free guidance (cond_fn=None) rather than clip guidance like in your colab?

To clarify any confusion:

when cond_fn is not None, I assume you are looking at the CLIP-guided approach: https://github.com/openai/glide-text2im/blob/main/notebooks/clip_guided.ipynb
the notebook linked in my first post is the classifier-free guidance, with cond_fn=None, copied from: https://github.com/openai/glide-text2im/blob/main/notebooks/text2im.ipynb

Unless I am missing something, the model_fn function is added to the sample loop in both notebooks called text2im.ipynb.

# Sample from the base model.
model.del_cache()
samples = diffusion.p_sample_loop(
    model_fn,
    (full_batch_size, 3, options["image_size"], options["image_size"]),
    device=device,
    clip_denoised=True,
    progress=True,
    model_kwargs=model_kwargs,
    cond_fn=None,
)[:batch_size]
model.del_cache()

Also, I have tried to combine the last two, and results seems to be better, like if clip guidance, for the small model introduces too much randomness. Any idea why?

I need to see the diff of what you did to understand better.

I would be glad to test this and see the results, if they are better. :) The black cat with white paws looks nice. 👍

loretoparisi commented 2 years ago

Thanks! I have two versions, this one

samples = diffusion.p_sample_loop(
    model,
    (batch_size, 3, options["image_size"], options["image_size"]),
    device=device,
    clip_denoised=True,
    progress=True,
    model_kwargs=model_kwargs,
    cond_fn=cond_fn,
)

where

cond_fn = clip_model.cond_fn([prompt] * batch_size, guidance_scale)

and in the latest colab from the repo

samples = diffusion.p_sample_loop(
    model_fn,
    (full_batch_size, 3, options["image_size"], options["image_size"]),
    device=device,
    clip_denoised=True,
    progress=True,
    model_kwargs=model_kwargs,
    cond_fn=None,
)[:batch_size]

with cond_fn=None and as model_fn

def model_fn(x_t, ts, **kwargs):
    half = x_t[: len(x_t) // 2]
    combined = th.cat([half, half], dim=0)
    model_out = model(combined, ts, **kwargs)
    eps, rest = model_out[:, :3], model_out[:, 3:]
    cond_eps, uncond_eps = th.split(eps, len(eps) // 2, dim=0)
    half_eps = uncond_eps + guidance_scale * (cond_eps - uncond_eps)
    eps = th.cat([half_eps, half_eps], dim=0)
    return th.cat([eps, rest], dim=1)