lucidrains / DALLE2-pytorch

Implementation of DALL-E 2, OpenAI's updated text-to-image synthesis neural network, in Pytorch
MIT License
11.03k stars 1.07k forks source link

[QUESTION] [BEGINNER] How to save image from 4d tensor? generating plain noise. #80

Closed dani3lh00ps closed 2 years ago

dani3lh00ps commented 2 years ago

Hi, I am running the following code:

import torch
from dalle2_pytorch import DALLE2, DiffusionPriorNetwork, DiffusionPrior, Unet, Decoder, OpenAIClipAdapter

# openai pretrained clip - defaults to ViT/B-32

clip = OpenAIClipAdapter()

# mock data

text = torch.randint(0, 49408, (4, 256)).cuda()
images = torch.randn(4, 3, 256, 256).cuda()

# prior networks (with transformer)

prior_network = DiffusionPriorNetwork(
    dim = 512,
    depth = 6,
    dim_head = 64,
    heads = 8
).cuda()

diffusion_prior = DiffusionPrior(
    net = prior_network,
    clip = clip,
    timesteps = 100,
    cond_drop_prob = 0.2
).cuda()

loss = diffusion_prior(text, images)
loss.backward()

# do above for many steps ...

# decoder (with unet)

unet1 = Unet(
    dim = 128,
    image_embed_dim = 512,
    cond_dim = 128,
    channels = 3,
    dim_mults=(1, 2, 4, 8)
).cuda()

unet2 = Unet(
    dim = 16,
    image_embed_dim = 512,
    cond_dim = 128,
    channels = 3,
    dim_mults = (1, 2, 4, 8, 16)
).cuda()

decoder = Decoder(
    unet = (unet1, unet2),
    image_sizes = (128, 256),
    clip = clip,
    timesteps = 100,
    image_cond_drop_prob = 0.1,
    text_cond_drop_prob = 0.5,
    condition_on_text_encodings = False  # set this to True if you wish to condition on text during training and sampling
).cuda()

for unet_number in (1, 2):
    loss = decoder(images, unet_number = unet_number) # this can optionally be decoder(images, text) if you wish to condition on the text encodings as well, though it was hinted in the paper it didn't do much
    loss.backward()

# do above for many steps

dalle2 = DALLE2(
    prior = diffusion_prior,
    decoder = decoder
)

generating images:

images = dalle2(
    ['a butterfly trying to escape a tornado'],
    cond_scale = 2. # classifier free guidance strength (> 1 would strengthen the condition)
)

and trying to save:

from torchvision.utils import save_image
save_image(images[0], 'img.png')

but the img.png is just plain noise... what am I missing here? can anyone please tell me. I just want to try out the code, I am new to ML.

rom1504 commented 2 years ago

You're missing the part "use a large dataset and a GPU for many hours to do a lot of forward, backward and optimization passes"

There is no pretrained model yet

On Tue, May 10, 2022, 10:14 dani3lh00ps @.***> wrote:

Hi, I am running the following code:

import torch from dalle2_pytorch import DALLE2, DiffusionPriorNetwork, DiffusionPrior, Unet, Decoder, OpenAIClipAdapter

openai pretrained clip - defaults to ViT/B-32

clip = OpenAIClipAdapter()

mock data

text = torch.randint(0, 49408, (4, 256)).cuda() images = torch.randn(4, 3, 256, 256).cuda()

prior networks (with transformer)

prior_network = DiffusionPriorNetwork( dim = 512, depth = 6, dim_head = 64, heads = 8 ).cuda()

diffusion_prior = DiffusionPrior( net = prior_network, clip = clip, timesteps = 100, cond_drop_prob = 0.2 ).cuda()

loss = diffusion_prior(text, images) loss.backward()

do above for many steps ...

decoder (with unet)

unet1 = Unet( dim = 128, image_embed_dim = 512, cond_dim = 128, channels = 3, dim_mults=(1, 2, 4, 8) ).cuda()

unet2 = Unet( dim = 16, image_embed_dim = 512, cond_dim = 128, channels = 3, dim_mults = (1, 2, 4, 8, 16) ).cuda()

decoder = Decoder( unet = (unet1, unet2), image_sizes = (128, 256), clip = clip, timesteps = 100, image_cond_drop_prob = 0.1, text_cond_drop_prob = 0.5, condition_on_text_encodings = False # set this to True if you wish to condition on text during training and sampling ).cuda()

for unet_number in (1, 2): loss = decoder(images, unet_number = unet_number) # this can optionally be decoder(images, text) if you wish to condition on the text encodings as well, though it was hinted in the paper it didn't do much loss.backward()

do above for many steps

dalle2 = DALLE2( prior = diffusion_prior, decoder = decoder )

generating images:

images = dalle2( ['a butterfly trying to escape a tornado'], cond_scale = 2. # classifier free guidance strength (> 1 would strengthen the condition) )

and trying to save:

from torchvision.utils import save_image save_image(images[0], 'img.png')

but the img.png is just plain noise... what am I missing here? can anyone please tell me. I just want to try out the code, I am new to ML.

— Reply to this email directly, view it on GitHub https://github.com/lucidrains/DALLE2-pytorch/issues/80, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR437R2HWRZRCQBO7PEWB3VJILFZANCNFSM5VQXHNDQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

dani3lh00ps commented 2 years ago

okay, so given we use clip model from openai, we still need to train the prior and decoder...