lucidrains / DALLE2-pytorch

Implementation of DALL-E 2, OpenAI's updated text-to-image synthesis neural network, in Pytorch
MIT License
11.03k stars 1.07k forks source link

Typo for text_encodings? #74

Closed CiaoHe closed 2 years ago

CiaoHe commented 2 years ago

Hi, me again (lol)

Just curious why set inited text_encodings's length as 0 https://github.com/lucidrains/DALLE2-pytorch/blob/8b054686530c90ecd8e8db62eb9c648d189accf9/dalle2_pytorch/dalle2_pytorch.py#L747-L748

I tried the test code in the Readme, but throw an error

concated token's shape: (b 4 d), :[text_encoding(b,0,d), text_embed(b,1,d), time_embed(b,1,d), image_embed(b,1,d), learned_queries(b,1,d)] mask's shape: (b 5)).

diffusion_prior = DiffusionPrior(
net = prior_network,
clip = clip,
timesteps = 100,
cond_drop_prob = 0.2,
condition_on_text_encodings = False  # this probably should be true, but just to get Laion started
).cuda()

But, when I write as text_encodings = torch.empty((batch, 1, dim), device = device, dtype = dtype), error gone.

Plz have a check. Enjoy!

lucidrains commented 2 years ago

@CiaoHe hey again :)

it seems to all work for me, could you paste the whole script you are running? (i think the text encoding conditioning should be present, but its good you are testing it without)

CiaoHe commented 2 years ago

oh, I just copy this part:

import torch
from dalle2_pytorch import DiffusionPriorNetwork, DiffusionPrior, CLIP

# get trained CLIP from step one

clip = CLIP(
    dim_text = 512,
    dim_image = 512,
    dim_latent = 512,
    num_text_tokens = 49408,
    text_enc_depth = 6,
    text_seq_len = 256,
    text_heads = 8,
    visual_enc_depth = 6,
    visual_image_size = 256,
    visual_patch_size = 32,
    visual_heads = 8,
).cuda()

# setup prior network, which contains an autoregressive transformer

prior_network = DiffusionPriorNetwork(
    dim = 512,
    depth = 6,
    dim_head = 64,
    heads = 8
).cuda()

# diffusion prior network, which contains the CLIP and network (with transformer) above

diffusion_prior = DiffusionPrior(
    net = prior_network,
    clip = clip,
    timesteps = 100,
    cond_drop_prob = 0.2,
    condition_on_text_encodings = False  # this probably should be true, but just to get Laion started
).cuda()

# mock data

text = torch.randint(0, 49408, (4, 256)).cuda()
images = torch.randn(4, 3, 256, 256).cuda()

# precompute the text and image embeddings
# here using the diffusion prior class, but could be done with CLIP alone

clip_image_embeds = diffusion_prior.clip.embed_image(images).image_embed
clip_text_embeds = diffusion_prior.clip.embed_text(text).text_embed

# feed text and images into diffusion prior network

loss = diffusion_prior(
    text_embed = clip_text_embeds,
    image_embed = clip_image_embeds
)

loss.backward()

# do the above for many many many steps
# now the diffusion prior can generate image embeddings from the text embeddings

Yeah, I think should provide with text_encoding, but anyway I'm just curious about all possibilities

lucidrains commented 2 years ago

@CiaoHe oh weird, it runs without error for me

lucidrains commented 2 years ago

@CiaoHe are you up to the latest version? git stash && git pull

CiaoHe commented 2 years ago

@CiaoHe are you up to the latest version? git stash && git pull

Aha, I just read each line of your code and type it down (so let me check your latest version first

lucidrains commented 2 years ago

@CiaoHe wow, you retyped all ~2k lines of code? :exploding_head:

CiaoHe commented 2 years ago

@CiaoHe wow, you retyped all ~2k lines of code? 🤯

Sounds really funny haha. But I think it better to check than just fork the whole down

CiaoHe commented 2 years ago

Yeah, weird, the latest version is cool. My wrong haha (let me check it)

I set the mask wrong mask = torch.ones((batch, text_encodings.shape[-2]), device=device, dtype=torch.bool) (text_encodings.shape[-2], I wrote as text_embeds.shape[-2])