lucidrains / DALLE-pytorch

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch
MIT License
5.55k stars 643 forks source link

Official DALL-E discrete VAE release #53

Open robvanvolt opened 3 years ago

robvanvolt commented 3 years ago

There has been an offiial DALL-E pytorch release of the discrete VAE used for DALL-E: https://github.com/openai/DALL-E Also, the paper finally got released with more details on the methods: https://arxiv.org/pdf/2102.12092.pdf

Does your pytorch package already incorporate these parameters from the above sources, or is there more to the fine tuning? If yes, in which manner do the models differ?

sorrge commented 3 years ago

Great! I wonder if they plan to release the generator part as well. Edit: apparently not: https://github.com/openai/DALL-E/issues/4

lucidrains commented 3 years ago

@robvanvolt indeed! there's a couple things that are different, which I will reconcile by end of week. Today I'm adding the exponential moving average piece

I'll also include an option so one can jump right into DALL-E training with the released version of the VAE (instead of having to train your own), by end of today

lucidrains commented 3 years ago

@robvanvolt ok done here https://github.com/lucidrains/dalle-pytorch#openais-pretrained-vae

molo32 commented 3 years ago

What advantage does training my own vae have than that provided by openai?

lucidrains commented 3 years ago

@molo32 if you are a researcher who thinks you could do better. Otherwise I'd just stick with the pretrained model and skip to training DALLE

Mut1nyJD commented 3 years ago

@molo32

Also it is not clear on what data the VAE has been trained on (at least I could not find any information about that in the paper), it might be very general so if you have a specific domain case you might be better off having your own VAE.

robvanvolt commented 3 years ago

@molo32 they mention the dataset used in the supplementary files:

In order to train the 12-billion parameter transformer, we created a dataset of a similar scale to JFT-300M by collecting250 million text-image pairs from the internet. As described in Section 2.3, this dataset incorporates Conceptual Captions,the text-image pairs from Wikipedia, and a filtered subset of YFCC100M. We use a subset of the text, image, and joint textand image filters described in Sharma et al. (2018) to construct this dataset. These filters include discarding instances whosecaptions are too short, are classified as non-English by the Python packagecld3, or that consist primarily of boilerplate phrases such as “photographed on”, wherematches various formats for dates that we found in the data.We also discard instances whose images have aspect ratios not in[1/2,2]. If we were to use to very tall or wide images, thenthe square crops used during training would likely exclude objects mentioned in the caption.

and

The dVAE is trained on the same dataset as the transformer, using the data augmentation code given in Listing 1.