openai / DALL-E

PyTorch package for the discrete VAE used for DALL·E.
Other
10.77k stars 1.94k forks source link

How to sample or generate a new image? #17

Closed JohnDreamer closed 3 years ago

JohnDreamer commented 3 years ago

Hi, it's a great work! But I am a little confused about how to generate a new image? Shall I give the sentence tokens and then use them to predict the image tokens? And where to inject the noise? It will be very appreciate that you can answer these questions, thank you!

amish-logicwind commented 3 years ago

I have the same issue. I want to generate images from text. How do I input string/sentence as input and expect images as output as mentioned here? Please let me know. Thanks!

EmilEOGG commented 3 years ago

I am not sure wether this is possible using this demo, since it contains only encoder and decoder, and we have no knowledge whatsoever on the language embedding part (apart from vocab size) I agree it would be cool to have it, but I doubt this will be released.

fractaldna22 commented 3 years ago

https://colab.research.google.com/drive/1oA1fZP7N1uPBxwbGIvOEXbTsq2ORa9vb?usp=sharing#scrollTo=O78RfTZfh7ji

Here -- use THIS! This is how to do it

amish-logicwind commented 3 years ago

Thank you @fractaldna22 for sharing! I have looked at it. Is there a way I can train it on my dataset? I request you to please let me know if there is a way to train on my dataset along with the text/captions.

TPreece101 commented 3 years ago

@amish-logicwind I think this is probably what you need: https://github.com/lucidrains/DALLE-pytorch @fractaldna22 Can you explain the notebook a little bit? Am I right in thinking that you are fitting some kind of latent space and evaluating the quality of it using CLIP?

fractaldna22 commented 3 years ago

@TPreece101 It's not my notebook, it belongs to @advadnoun who also created several variations But i believe that's sorta what it does.

CLIP isnt evaluating the quality but rather how closely the image generated matches with the tokenized text input. The Dall_e encoder and decoder convert the image and text into tokens and maps them on the latent space, then Clip, using a Visual Transformer (ViT-B-32) evaluates whether the image matches the categories denoted in the text input. It returns a loss to dall_e, then dall_e unmaps the pixels again, adjusts them more, maps, sends to the VIT, and so on, until it converges or collapses depending on factors.

its IMO CLIP.model('ViT-B/32') that's the real gem here. Its clip which is so good at matching images with text and going multi-modal with it. It's only limited by the size of the VIT and its training dataset.

There are multiple models you can use for clip , the regular ViT-B-32.pt, RN50, RN101, RN50x4, and there are several new ViTs out that include ViT-L-32 which is 1.4 GB vs the default B-32 which is only 300 mb. Theres also hybrid models.

I dont know how to set up the new ViT models so that they load in clip. Clip model settings specify that if its not in the list, or downloaded via a URL, it will refuse to load it . And inside the ViT it has to have certain init words like "apply" and "version". I have no idea how to set that, even though the Vits are pre-trained with a large dataset.

"I think this is probably what you need: https://github.com/lucidrains/DALLE-pytorch"

@amish-logicwind and @TPreece101

eh. I haven't seen any images with that. And that notebook focuses on training the VaE which is just an auto encoder and not a visual transformer - so training it on a dataset does nothing.

Try these: https://www.kaggle.com/abhinand05/vision-transformer-vit-tutorial-baseline/notebook , https://github.com/rwightman/pytorch-image-models/

Just make sure its one of the comptaible models. ViT-B-32, RN50, RN101, RN50x4. If anyone can figure out how to load in a different pretrained model into clip with this notebook i would love to know how.

if you're gonna train something, train the Visual Transformer model using either Jax, Clip, or Timm recipes.
But ---- if you have a pretrained model thats already trained on the entire ImageNet, Cifaar etc, why do you need to train it? It can literally Imagine everything, it can combine concepts based on what it already knows as long as you give it the right prompt. You can say "JFK as an anime character" and it will make that, without training it on any anime dataset . It already has knowledge and can infer what you meant.

the only thing the ViT-Base-32 model struggles with really, at least this implementation of it, is Focusing its vision onto a central point, it sees kinda cross eyed. Especially when its trying to make a human face - i tends to overlap two faces slightly as if its literal eyes were bad. Maybe someone will see the problem and it could be something as simple as a changing a 48 to a 32 or something in one of the parameters..

You can just set

perceptor, preprocess = clip.load('ViT-B/32', torch.device('cuda:0'). jit=True) perceptor.train() instead of preceptor.eval() and I think that will cause it to improve itself over time the more you use it. When you're done. !cp the model out of '/root/.cache/clip/ViT-B-32.pt' to your google drive so you can load it back into .cache/clip next time and continue giving it practice.

TPreece101 commented 3 years ago

@fractaldna22 Thanks for the explanation - makes sense, although I will dig in a bit more later to get a better understanding.

I've been playing around with the notebook and for some reason all of my pictures some out with a lot of white in the middle - see pic for prompt- 'A portrait of Abe Lincoln': https://ibb.co/nBtdC4T

I'm guessing something is exploding somewhere pushing the RGB values to the max but I'm going to investigate in more detail - just wondering if anyone has any ideas?

fractaldna22 commented 3 years ago

@TPreece101 it has to do with the default temperature or tau in the latent coordinates.

in this version of the notebook its defined by "hadies" (temperature lol). The best measure against the collapsing of image is by turning it up to 1.4 but you can turn it up 2, 2.5 or any positive number, but 1.4-1.7 generally works best. The higher you go the thicker the image, but also it introduces some weirdness or double vision sometimes if you go too high.

i would use this instead, ill edit the things that i usually change when using the default notebook.

Latent coordinates

    def __init__(self):
        super(Pars, self).__init__()

        hots = torch.nn.functional.one_hot((torch.arange(0, 8192).to(torch.int64)), num_classes=8192)
        rng = torch.zeros(batch_size, 64*64, 8192).uniform_()**torch.zeros(batch_size, 64*64, 8192).uniform_(.1,1)
        for b in range(batch_size):
          for i in range(64**2):
            rng[b,i] = hots[[np.random.randint(8191)]]

        rng = rng.permute(0, 2, 1)

        self.normu = torch.nn.Parameter(rng.cuda().view(batch_size, 8192, 64, 64))

    def forward(self):

      normu = torch.softmax(hadies*self.normu.reshape(batch_size, 8192//2, -1), dim=1).view(batch_size, 8192, 64, 64)
      return normu

lats = Pars().cuda()
mapper = [lats.normu]
optimizer = torch.optim.Adam([{'params': mapper, 'lr': .10}])
eps = 0

tx = clip.tokenize(text_input)
t = perceptor.encode_text(tx.cuda()).detach().clone()

nom = torchvision.transforms.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711))

will_it = False
hadies = 1.4
with torch.no_grad():
  al = unmap_pixels(torch.sigmoid(model(lats()).cpu().float())).numpy()
  for allls in al:
    displ(allls[:3])
    print('\n')
  # print(lats())
  # print(lats().sum())

for the next cell, train / generate:

def checkin(loss):
  global hadies
  print("
  ##########################################################
  ",
        loss, ' (loss)\n',itt)

  with torch.no_grad():

    al = unmap_pixels(torch.sigmoid(model(lats())[:, :3]).cpu().float()).numpy()
    for allls in al:
      displ(allls)
      display.display(display.Image(str(3)+'.png'))
      print('\n')

def ascend_txt():
  out = unmap_pixels(torch.sigmoid(model(lats())[:, :3].float()))

  cutn = 32 # improves quality
  p_s = []
  for ch in range(cutn):
    size = int(sideX*torch.zeros(1,).uniform_(.5, .99))#.normal_(mean=.4, std=.80).clip(.5, .98))
    offsetx = torch.randint(0, sideX - size, ())
    offsety = torch.randint(0, sideX - size, ())
    apper = out[:, :, offsetx:offsetx + size, offsety:offsety + size]
    apper = torch.nn.functional.interpolate(apper, (224,224), mode='nearest')
    p_s.append(apper)
  into = torch.cat(p_s, 0)
  # into = torch.nn.functional.interpolate(out, (224,224), mode='bilinear')

  into = nom((into))

  iii = perceptor.encode_image(into)

  lat_l = 0

  return [lat_l, 100*-torch.cosine_similarity(t, iii).view(-1, batch_size).T.mean(1)]

def train(i):
  global hadies
  loss1 = ascend_txt()
  loss = loss1[0] + loss1[1]
  loss = loss.mean()
  optimizer.zero_grad()
  loss.backward()
  optimizer.step()

  # hadies /= 1.01
  # hadies = max(hadies, 1.5)

  for g in optimizer.param_groups:
    g['lr'] = g['lr']*1.0

  if itt % 10 == 0:
    # print('temp', hadies)
    # print(g['lr'], 'lr')
    checkin(loss1)

itt = 0
for asatreat in range(10000):
  train(itt)
  itt+=1

hopefully those work because im just eyeballing it here lol. its definitely the tau / hadies though, to get more layers.

Also, I lowered the learning rate from 1.5 to 1.0. when its too fast its more likely to collapse. 1 is usually good sometimes even lower is better for quality.

TPreece101 commented 3 years ago

Great, thanks a lot @fractaldna22, it has stopped collapsing now. Have you managed to get any good results out of it so far?

fractaldna22 commented 3 years ago

image

"a hedgehog made of violins. A hedgehog with the texture of violin"

amish-logicwind commented 3 years ago

@fractaldna22 The results seem to be really good. It would be great if you can share your colab link. I tried changing as per your suggestions, but the results are not satisfactory. Have a look at this: https://colab.research.google.com/drive/1oA1fZP7N1uPBxwbGIvOEXbTsq2ORa9vb?usp=sharing Actually, I want it to generate textile design patterns. Is there a way I can train it on my data?

TPreece101 commented 3 years ago

@fractaldna22 Have you managed to find a url to download ViT-L-32? I can't seem to find any references to it anywhere?

fractaldna22 commented 3 years ago

I have but i can't figure out how to configure it for use by clip. Its obviously possible but its so confusing and im still new.

Look for ViT Pytorch and those repos usually have pretrained models. The most accurate model currently is ViT-H_14, the "huge" model. it achieves 99% accuracy in some tests.

if it can classify it, it can do the reverse and guide the generation of images. the more accurate clip is, the more accurate the VAE is.

fractaldna22 commented 3 years ago

Custom Notebook

fractaldna22 commented 3 years ago

well? no dopamine reward? lol

JohnDreamer commented 3 years ago

@fractaldna22 Thanks for your reply! But I'm still a little confused. In the paper, the image generation step is: 1. using the transformer to generate image tokens, with the input of text tokens; 2. put the generated image tokens into the decoder of dVAE to produce RGB images; 3. using the CLIP to select the best one. However, in your script, the image generation is following: 1. randomly initialize the index matrix; 2. put the index matrix into decoder of dVAE to produce RGB image; 3. using CLIP to calculate the similarity between the generated image and test tokens as a loss to optimize the index matrix, while keeping the weights of the network unchanged. In your pipeline, there isn't such a transformer predictor mentioned in the paper. Could you please explain this question? Thank you!

TPreece101 commented 3 years ago

Thanks @fractaldna22 that notebook is much easier to use for tweaking and such.

fractaldna22 commented 3 years ago

@JohnDreamer It's not my code - but the reason is because DALL-E didn't release its transformer and released a VERY small simple autoencoder that is hardly necessary.. We did fine without it, using it only as a decoder to remap the pixels.

CLIP is the real mastermind behind the image generation because Attention is all you need.

JohnDreamer commented 3 years ago

@fractaldna22 Got it! Thank you very much!

fractaldna22 commented 3 years ago

I just made this with another clip+(different vae) notebook which im not at liberty to share, but i can share the content https://www.youtube.com/watch?v=5HcBxeS7jkQ

JohnDreamer commented 3 years ago

@fractaldna22 In fact, if we follow the paper's pipeline, we input the text tokens into GPT3 and predict the image tokens one by one. If the input text tokens are determined, then the generated image tokens are determined. How does Dall-E generate diverse images? Does it choose the next image token by the probability? Do you have any idea about this?

fractaldna22 commented 3 years ago

It just knows what objects are labeled, can label images going in and do the same thing in reverse, find images based on labels like a search. It chooses pixels using cosine similarity, how strictly the clip neurons associated with a label react to an image. It's not like an image has one token, it has billions. Each pixel is a token. That's no deterministic overlap between a text token to a pixel, that would literally be magic.. it takes the whole label, compares it to a datesets of images with that label, and then the vae maps it onto the image.

Loss functions, weight decay and things like soft max, normalize, augments etc are the only way to make it not seem like Google image search. They add noise and noise is the key to make it continuously choose new ways to match a label with a stack of image shapes.

On Mon, Mar 22, 2021, 8:54 AM JohnDreamer @.***> wrote:

@fractaldna22 https://github.com/fractaldna22 In fact, if we follow the paper's pipeline, we input the text tokens into GPT3 and predict the image tokens one by one. If the input text tokens are determined, then the generated image tokens are determined. How does Dall-E generate diverse images? Does it choose the next image token by the probability? Do you have any idea about this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openai/DALL-E/issues/17#issuecomment-804038147, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI4YF7TPVXICDO7AE3WIBVTTE44X7ANCNFSM4YVK7GKQ .

watashiwa-toki commented 3 years ago

Custom Notebook

@fractaldna22 Great! Is it possible to change output image resolution? (pic - result from "A samurai fighting with ninja" text input)

samurai_ninja

Mimocro commented 3 years ago

i just place it there. For some purposes, I changed this https://github.com/openai/DALL-E/issues/17#issuecomment-797507142 code, now I translated it back into English. https://colab.research.google.com/drive/1fS8D61V-CTlup7nsSK-KsXGwLVFio20o?usp=sharing For now this notebook has: -no errors when setting up the environment -significant parameters -almost all the code (not counting the counter, which must be initialized outside the main cell) in one cell -you can download all generated pictures in zip or gif format -bad eng translation -easy interface -u can disable or enable normalization (dry colors) by checkbox

an red ball 1,2k interation: an red ball

A samurai fighting with ninja 1,3k interation (i dont know why one of samurai fighting with horse and ninja fighting with tree): A samurai fighting with ninja

2,2k A samurai fighting with ninja 2

amish-logicwind commented 3 years ago

@Mimocro @fractaldna22 @JohnDreamer @TPreece101 Is there a way to generate a group of images (say 10 or 12). What I could see is that it can generate a single image and improvise it. It would be great if multiple images can be generated. Something similar to this notebook.

Mimocro commented 3 years ago

@amish-logicwind all this colabs there (include my if you disable cleaner) can generate a group of images but dont 10 or 12, its more for a bit. Also for see differences of interations in my version of colab you can run the cells under main cell to save it with gif or zip format.

amish-logicwind commented 3 years ago

@Mimocro @fractaldna22 @JohnDreamer @TPreece101 I am working on a project in which I need to generate images based on some training images. I have done that here, but the size of output images are very small. I am a beginner in GAN, and don't know which parameters must be changed in the discriminator and generator to generate images of the output size of 200 * 200. Please guide me a little. It would be great if anyone of you can have a look at the above colab suggest the necessary changes. Thanks!

Mimocro commented 3 years ago

@amish-logicwind maybe change in 4rd cell HEIGHT = 32 WIDTH = 54 To HEIGHT = 200 WIDTH = 200 ? Also im beginner too, and this colab not work for me because i dont have required dir in my drive and i cant do something in it. But try to change code which show image by mathplotlib to show just image.

amish-logicwind commented 3 years ago

@Mimocro Bro, this is not a one-line change. We need to change the parameters in the layers also.

Mimocro commented 3 years ago

@amish-logicwind why, this lines changes size in output what drawing with plt. Original sizes more than 200x200 and they dont need to change i think. This error because plt cant draw big picture in small. Just remove all plt and replace displaying function at every n epochs with something like display(generated_images) or equivalent.

amish-logicwind commented 3 years ago

@Mimocro Don't you think that parameters in the first layers of generator model must change? As the we have increased rhe image size, we need more layers to deconvolve it, right?

Mimocro commented 3 years ago

@amish-logicwind i think no, 3rd cell how i can see gives real sizes, when height and width just set the output image sizes. I mean in output of training cell you can see downscaled image, when nn generating images in sizes by categories in first cell. I dont pro in that and dont know best solution how to see real size images, but i dont think what you really need real sizes - its more than 200x200.

adityaramesh commented 3 years ago

The transformer used to generate the images from the text is not part of this code release. I've since modified the README to state this explicitly.

There are collab notebooks available that can be used to generate images by steering a generative model with CLIP, but these are unrelated to this release.

Asthestarsfalll commented 2 years ago

@fractaldna22 Thanks for your reply! But I'm still a little confused. In the paper, the image generation step is: 1. using the transformer to generate image tokens, with the input of text tokens; 2. put the generated image tokens into the decoder of dVAE to produce RGB images; 3. using the CLIP to select the best one. However, in your script, the image generation is following: 1. randomly initialize the index matrix; 2. put the index matrix into decoder of dVAE to produce RGB image; 3. using CLIP to calculate the similarity between the generated image and test tokens as a loss to optimize the index matrix, while keeping the weights of the network unchanged. In your pipeline, there isn't such a transformer predictor mentioned in the paper. Could you please explain this question? Thank you!

HI, I'm wondering that is the index matrix learnable? My understanding is that during the training stage, the input index matrix is the only parameter. And the training will end as long as the score generated by CLIP is low enough. Thank for you help in advance!

fractaldna22 commented 2 years ago

During optimization, which is continuous and is the main and only way to generate images from text using this method, the loss is essentially negative and has no bottom limit. It is always a multiple of -1 or -.1 to taste, and each time step is only slightly higher or lower, -1.53, -1.3, -0.5, -1.6, -.1, etc. It has no end state or "finished" except whenever you decide to end it based on subjective preference. The loss is less important than the actual Image itself, the loss tells you nothing really. And we actually want continuous improvement and for it to constantly be striving to improve the image, especially if you're making an animation, you just keep adding noise, encode the image, replace the tensor in the optimizer with the new encoded tensor, optimizer step, decode tensor and add noise, encode, optimizer step, etc. If you slightly zoom in or alter the image with affine ever so slightly every time step it forces the model to recognize new features and evolves the image as good as any SOTA text to video mode

On Sat, 27 Aug 2022 at 11:57, Asthestarsfalll @.***> wrote:

@fractaldna22 https://github.com/fractaldna22 Thanks for your reply! But I'm still a little confused. In the paper, the image generation step is: 1. using the transformer to generate image tokens, with the input of text tokens; 2. put the generated image tokens into the decoder of dVAE to produce RGB images; 3. using the CLIP to select the best one. However, in your script, the image generation is following: 1. randomly initialize the index matrix; 2. put the index matrix into decoder of dVAE to produce RGB image; 3. using CLIP to calculate the similarity between the generated image and test tokens as a loss to optimize the index matrix, while keeping the weights of the network unchanged. In your pipeline, there isn't such a transformer predictor mentioned in the paper. Could you please explain this question? Thank you!

HI, I'm wondering that is the index matrix learnable? My understanding is that during the training stage, the input index matrix is the only parameter. And the training will end as long as the score generated by CLIP is low enough. Thank for you help in advance!

— Reply to this email directly, view it on GitHub https://github.com/openai/DALL-E/issues/17#issuecomment-1229217839, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI4YF7RZWMZKTZY7KEH44G3V3I3GNANCNFSM4YVK7GKQ . You are receiving this because you were mentioned.Message ID: @.***>

fractaldna22 commented 2 years ago

i was thinking of vqgan sorry. Dvae is similar but is much less fluid and less robust from collapsing. there is typically only one image from dvae generation and it only improves the image it initially settles on but doesnt tend to evolve like vqgan does