lucidrains / DALLE-pytorch

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch
MIT License
5.54k stars 644 forks source link

More "OpenAI Blog Post" Training | Depth 32 | Heads 8 | LR 5e-4 #86

Closed afiaka87 closed 3 years ago

afiaka87 commented 3 years ago

Edit: Moved to discussions: https://github.com/lucidrains/DALLE-pytorch/discussions/106

Hey, all. Some of you might know I'm practicing and learning about machine learning with dalle-pytorch and a dataset consisting of the images OpenAI presented in the DALLE blog post. I honestly dont have the money to train this whole dataset,

edit: this is no longer true. Using the 1024 VQGAN from the "Taming Transformers" research, it's now quite possible to train a full dataset of 1,000,000 image-text pairs and i'm doing just that. I hope to have it finished in about a week. I assume someone else will release a dalle-pytorch trained properly on COCO and other image sets before then, but if they dont, check here for updates.

Anway, it ran for ~36000 steps. As you can see it...still really likes mannequins. I'm considering removing them from the dataset. But also, you'll notice that the network has actually got a decent idea of the sort of general colors that belong in types of prompts.

Some Samples from Near the End of Training

results

Every Text-Image Reconstruction

https://wandb.ai/afiaka87/dalle_pytorch_live_training/reports/dalle-pytorch-Test-Run-2--Vmlldzo1MzM5MjQ

Deliverables (my train_dalle.py)

https://gist.github.com/afiaka87/850fb3cc48edde8a7ed4cb1ce53b6bd2

This has some code in it that actually manages to deal with truncated images via Try Catch. Apparently detecting a corrupted PNG is harder than P vs NP. PIL's imverify() function doesnt catch all of them. Python's built in imghdr library doesn't catch all of them either. So you just sort of catch OSError and return an item further along. Works well enough.

Parameters

SHUFFLE = True
EPOCHS = 28 # This wound up being less than a single epoch, of course. 
BATCH_SIZE = 16
LEARNING_RATE = 0.0005 # I found this learning rate to be more suitable than 0.0003 in my hyperparameter sweep post
GRAD_CLIP_NORM = 0.5
DEPTH = 32
HEADS = 8
MODEL_DIM = 512
TEXT_SEQ_LEN = 256
DIM_HEAD = 64
REVERSIBLE = True,
ATTN_TYPES = ('full')

Dataset Description

https://github.com/lucidrains/DALLE-pytorch/issues/61#issuecomment-796663342

Just for more info on the dataset itself, it is roughly 1,100,000 256x256 image-text pairs that were generated by OpenAI's DALL-E. They presented roughly ~30k unique text prompts of which they posted the top 32 of 512 generations on https://openai.com/blog/dall-e/. Many images were corrupt, and not every prompt has a full 32 examples, but the total number of images winds up being about 1.1 million. If you look at many of the examples on that page, you'll see that DALL-E (in that form at least), can and will make mistakes. These mistakes are also in this dataset. Anyway I'm just messing around having fun training and what not. This is definitely not going to produce a good model or anything.

There are also a large number of images in the dataset which are intended to be used with the "mask" feature. I don't know if that's possible yet in DALLE-pytorch though. Anyway, that can't be helping much.

afiaka87 commented 3 years ago

@lucidrains btw, i've been going through the wandb.ai docs and found some nice extras you can add to train_dalle.py that will give you live updates on the transformer itself:

config = wandb.config
config.depth = DEPTH
config.heads = HEADS
config.dim_head = DIM_HEAD
config.learning_rate = LEARNING_RATE
config.shuffle = SHUFFLE
config.resume = RESUME
config.batch_size = BATCH_SIZE
config.grad_clip_norm = GRAD_CLIP_NORM
config.reversible = REVERSIBLE
config.model_dim = MODEL_DIM
config.attn_types = ATTN_TYPES

wandb.init(project = PROJECT_NAME, resume = RESUME)

wandb.watch(dalle) # Updates a graph of gradients on wandb as soon as your model changes

In particular, that very last line is actually all you need to add. But attaching all the parameters in the way I did also allows it to track those better and you can more easily create hyperparameter sweeps from existing projects when you do.

lucidrains commented 3 years ago

@afiaka87 ohh got it! i'm circling back to DALL-E this week for some final instrumentations :) i'll be sure to add that! 🙏

afiaka87 commented 3 years ago

@lucidrains Awesome, looking forward to it! Thanks for patching up big-sleep/deep-daze btw. I tried but I'm so distracted with this project now lol.

lucidrains commented 3 years ago

@afiaka87 yes, arguably getting a DALL-E model trained and released would be bigger than either big sleep or deep daze!

lucidrains commented 3 years ago

thanks for doing this! it demonstrates that reversibility does work :)

afiaka87 commented 3 years ago

@afiaka87 yes, arguably getting a DALL-E model trained and released would be bigger than either big sleep or deep daze!

@lucidrains For sure! I'm trying to be as open about my training, code, results, etc. but I'm not seeing much else of that here. I'm aware it's prohibitively expensive for most though and I'm privileged to be able to run Depth=32 for a day or two. At any rate, looking forward to 1024 token model from Germany! I know it's in there currently but I'm still having some trouble with it last I checked. All in due time.

afiaka87 commented 3 years ago

thanks for doing this! it demonstrates that reversibility does work :)

There should be system usage info in the graphs on wandb.ai, but yeah it does what it says on the label, lol. You definitely trade time for space. But, that whole training session never went above 16 GiB of VRAM. So at least people can use colab!

lucidrains commented 3 years ago

@afiaka87 great to know! also, do log the issue with the VQ-GAN VAE and i'll be sure to fix it this week. It seems to be working on my end, but I haven't tried testing it from a fresh install

afiaka87 commented 3 years ago

@lucidrains one last thing, but the "image masking" feature is used pretty thoroughly in this dataset and they even have the image used for the mask and everything. Let me know as soon as that feature is implemented as I would love to use those as a baseline for that.

lucidrains commented 3 years ago

@afiaka87 is that the feature where they have half an image and have it complete the other half?

afiaka87 commented 3 years ago

@lucidrains Yes. "The exact same cat on the top {insert style qualifer here} on the bottom." style ones. They're passing the top half in, as well as a prompt that acknowledges both pictures and presumably forcing the top half to stay the same while it trains.

lucidrains commented 3 years ago

@afiaka87 yup, i can build that :)

afiaka87 commented 3 years ago

Great let me know asap. The zero-shot style transfer stuff is so cool to me.

robvanvolt commented 3 years ago

@afiaka87 yes, arguably getting a DALL-E model trained and released would be bigger than either big sleep or deep daze!

@lucidrains For sure! I'm trying to be as open about my training, code, results, etc. but I'm not seeing much else of that here. I'm aware it's prohibitively expensive for most though and I'm privileged to be able to run Depth=32 for a day or two. At any rate, looking forward to 1024 token model from Germany! I know it's in there currently but I'm still having some trouble with it last I checked. All in due time.

Agree! Really awesome work of lucidrains for trying to replicate such an awesome tool like DALL-E! If we only could collaborate in a more efficient way - somehow like in the blockchain, where a few people improve the DALL-E and the best get chosen after 2 days, gets distributed again and a new search for better optimization begins... I think your hyperparameter session is a great step forward @afiaka87 ! I will have my big system running in a week, so i hope to contribute then in a more significant way!

By the way, the open images V6 dataset (https://storage.googleapis.com/openimages/web/download.html) has "localized" narratives, which might fit perfectly for the Dall-E for training! Maybe I will generate a downsampled version (256x256px) with captions like in the DALL-E format required, that would speed up search for training dataset and could improve collaborations.

afiaka87 commented 3 years ago

@robvanvolt Yep that's a perfect dataset. I found this dataloader here.

https://github.com/google/localized-narratives/blob/master/localized_narratives.py

And the downloader for the images: https://raw.githubusercontent.com/openimages/dataset/master/downloader.py

You should be able to modify the DataLoader to load the correct image for the given localized narrative somewhat easily. This would also lend itself well to Weights and Biases artifacts (you just map urls to things and it downloads and caches them for you, pinning versions if things change).

Let me know if you get started on this and need any help. I think this would produce a great result!

afiaka87 commented 3 years ago

@afiaka87 yes, arguably getting a DALL-E model trained and released would be bigger than either big sleep or deep daze!

@lucidrains For sure! I'm trying to be as open about my training, code, results, etc. but I'm not seeing much else of that here. I'm aware it's prohibitively expensive for most though and I'm privileged to be able to run Depth=32 for a day or two. At any rate, looking forward to 1024 token model from Germany! I know it's in there currently but I'm still having some trouble with it last I checked. All in due time.

Agree! Really awesome work of lucidrains for trying to replicate such an awesome tool like DALL-E! If we only could collaborate in a more efficient way - somehow like in the blockchain, where a few people improve the DALL-E and the best get chosen after 2 days, gets distributed again and a new search for better optimization begins... I think your hyperparameter session is a great step forward @afiaka87 ! I will have my big system running in a week, so i hope to contribute then in a more significant way!

By the way, the open images V6 dataset (https://storage.googleapis.com/openimages/web/download.html) has "localized" narratives, which might fit perfectly for the Dall-E for training! Maybe I will generate a downsampled version (256x256px) with captions like in the DALL-E format required, that would speed up search for training dataset and could improve collaborations.

I went ahead and downloaded all 500,000 of all of their images with "localized annotations". I'm training currently! The download is not for the faint of heart though. Winds up being 169 GiB of data. Anyway, I can at least share the proper structure for the "*.txt" files as well as the "file_ids.txt" list of of image ids to download.

wget https://www.dropbox.com/s/3s0saz480hlg651/ids_to_download.txt
wget https://www.dropbox.com/s/ni95in1k7wpetso/captions.tar.gz # contains structure for localized annotations. Plop this folder next to the folder you put your images in.
tar -cf captions.tar.gz --directory=~/project_name/captions .
Jinglei5 commented 3 years ago

@afiaka87 yes, arguably getting a DALL-E model trained and released would be bigger than either big sleep or deep daze!

@lucidrains For sure! I'm trying to be as open about my training, code, results, etc. but I'm not seeing much else of that here. I'm aware it's prohibitively expensive for most though and I'm privileged to be able to run Depth=32 for a day or two. At any rate, looking forward to 1024 token model from Germany! I know it's in there currently but I'm still having some trouble with it last I checked. All in due time.

Agree! Really awesome work of lucidrains for trying to replicate such an awesome tool like DALL-E! If we only could collaborate in a more efficient way - somehow like in the blockchain, where a few people improve the DALL-E and the best get chosen after 2 days, gets distributed again and a new search for better optimization begins... I think your hyperparameter session is a great step forward @afiaka87 ! I will have my big system running in a week, so i hope to contribute then in a more significant way! By the way, the open images V6 dataset (https://storage.googleapis.com/openimages/web/download.html) has "localized" narratives, which might fit perfectly for the Dall-E for training! Maybe I will generate a downsampled version (256x256px) with captions like in the DALL-E format required, that would speed up search for training dataset and could improve collaborations.

I went ahead and downloaded all 500,000 of all of their images with "localized annotations". I'm training currently! The download is not for the faint of heart though. Winds up being 169 GiB of data. Anyway, I can at least share the proper structure for the "*.txt" files as well as the "file_ids.txt" list of of image ids to download.

wget https://www.dropbox.com/s/3s0saz480hlg651/ids_to_download.txt
wget https://www.dropbox.com/s/ni95in1k7wpetso/captions.tar.gz # contains structure for localized annotations. Plop this folder next to the folder you put your images in.
tar -cf captions.tar.gz --directory=~/project_name/captions .

Thanks a lot! However, I could not download the captions.tar.gz from your dropbox (maybe the link is broken, since ids_to_download.txt is fine to download). I wonder how you reorganized the captions from the annotations. Did you use classname as the caption of the image? Thanks again!

afiaka87 commented 3 years ago

@Jinglei5 Hm, I'll see if I can fix that. Unfortunately my internet has just gone out halfway through training 🙄. On my phone till it's back up so it may be a bit.

afiaka87 commented 3 years ago

Hm, dropbox is insisting that I've set that file to be publically shared. Wouldy ou mind trying again with this?

wget https://www.dropbox.com/s/ni95in1k7wpetso/captions.tar.gz?dl=0

You'll have to rename the file as it will include the ?dl=0 bit, but that's the only thing I can think of. If that still doesnt work, i'll host it elsewhere.

@Jinglei5 as for how i reorganized the captions, the current DataLoader literally just expects every single unique png in your folder to have a respectively named txt that contains its text descriptions. If you go to the "localized annotations" page, you'll find a .jsonl file containing a mapping of each text phrase to image ids. The rest is just some python scripting to create a bunch of files with the same names as your images and fill them with the correct text descriptions.

Here's my copy of the .jsonl file https://www.dropbox.com/s/9g6hbnyc1pek462/open_images_train_v6_captions.tar.xz?dl=0

Probably best to find the original again though. I'll be back with an edit.

afiaka87 commented 3 years ago

Just a general heads up though - these captions aren't great. Due to the ability to use a mouse to "tell the dataset" where in the image they were referring to, they often leave out explicit directions knowing that the information will be in there.

For instance:

"in this image there is a depiction in the bottom of this image and there are two persons standing on the right side to this , and there are some shelters in the background , and there are some trees as we can see in the bottom of this image , and there is a sky on the top of this image ."

or

"in the down side it is water . in the long back side there are trees and big buildings"

The captions not only contain pretty glaring grammar mistakes but the information about the location is also missing from these prompts because the annotater (labeler? what do we call that?) knows that the computer is getting that info from their mouse.

Jinglei5 commented 3 years ago

Hm, dropbox is insisting that I've set that file to be publically shared. Wouldy ou mind trying again with this?

wget https://www.dropbox.com/s/ni95in1k7wpetso/captions.tar.gz?dl=0

You'll have to rename the file as it will include the ?dl=0 bit, but that's the only thing I can think of. If that still doesnt work, i'll host it elsewhere.

@Jinglei5 as for how i reorganized the captions, the current DataLoader literally just expects every single unique png in your folder to have a respectively named txt that contains its text descriptions. If you go to the "localized annotations" page, you'll find a .jsonl file containing a mapping of each text phrase to image ids. The rest is just some python scripting to create a bunch of files with the same names as your images and fill them with the correct text descriptions.

Here's my copy of the .jsonl file https://www.dropbox.com/s/9g6hbnyc1pek462/open_images_train_v6_captions.tar.xz?dl=0

Probably best to find the original again though. I'll be back with an edit.

It works this time! Thanks! True, in captions, there are phrases like 'In front of the picture' and 'we see'. Not sure whether they are useful or having side-effect for the model.

afiaka87 commented 3 years ago

@Jinglei5 I'm gonna try mixing it with COCO2018 to see if it can at least get an idea of what a regular prompt might look like.

afiaka87 commented 3 years ago

@Jinglei5 also currently in the (very lengthy) process of converting all of these to 256px jpegs so I can actually move them around a bit. Do you have an existing workflow for that? Right now I'm just using imagemagick convert in a for loop.

robvanvolt commented 3 years ago

Hm the annotations looked pretty solid in the first place, but we will see how the grammar mistakes and the bad orientation get handled...

A few other interesting points:

Dall-E was trained with redundancy, e.g.

a neon sign that reads“backprop”. a neon sign thatreads “backprop”. backpropneon sign

So this shouldn't be a problem, as i previously thought.

64 16 GB NVIDIA V100 GPUs, with a per-GPU batch sizeof8, resulting in a total batch size of 512

Incredible computation power was used by Open-AI - this will be tough to optimize to get near the results of Open-AI...

we created a dataset of a similar scale to JFT-300M by collecting 250 million text-image pairs from the internet.

Also, the dataset collected is insane - 250 million!

his dataset incorporates Conceptual Captions,the text-image pairs from Wikipedia, and a filtered subset of YFCC100M.

Wikipedia might be another solid source for text-image pairs.

Also, we might need to establish a better filter that we all use for training:

These filters include discarding instances whose captions are too short, are classified as non-English by the Python package cld3, or that consist primarily of boilerplate phrases such as “photographed on ”, where matches various formats for dates that we found in the data.

And finally:

We also discard instances whose images have aspect ratios not in[1/2,2]. If we were to use to very tall or wide images, thenthe square crops used during training would likely exclude objects mentioned in the caption.

This might also be important, as i've seen a lot of images in different aspect ratios.

On the other hand, we might have a better / faster transformer with 1024 VQGAN, which might speed up things a little bit.

Jinglei5 commented 3 years ago

@Jinglei5 also currently in the (very lengthy) process of converting all of these to 256px jpegs so I can actually move them around a bit. Do you have an existing workflow for that? Right now I'm just using imagemagick convert in a for loop.

Sorry, I don't have the workflow. I just sampled 10,000 of them to feed the model directly for a trial right now. ><

afiaka87 commented 3 years ago

On the other hand, we might have a better / faster transformer with 1024 VQGAN, which might speed up things a little bit.

@robvanvolt Here's some early results from training on that dataset by the way. I think we should definitely clean it up with the info from OpenAI. https://wandb.ai/afiaka87/OpenImagesV6/reports/dalle-pytorch-OpenImagesV6-With-Localized-Annotations---Vmlldzo1MzgyMTU

After about ~15k iters, I stopped training, added the COCO2018 dataset and resumed from there for another ~6K steps. https://wandb.ai/afiaka87/OpenImagesV6/reports/OpenImagesV6-COCO--Vmlldzo1MzgyNTI

@lucidrains @Jinglei5

afiaka87 commented 3 years ago

I'll probably make another post once I'm finished training. I think i'm ultimately gonna go with a combination of all three datasets I've accrued so far: COCO2018, OpenImagesV6 and the ~1 million images from the OpenAI blog post. The size of openai's dataset is definitely discouraging though.

@robvanvolt I'm assuming there's a relatively easy way to get captioned images from wikipedia, no? That's what I'm after next.

afiaka87 commented 3 years ago

@Jinglei5 also currently in the (very lengthy) process of converting all of these to 256px jpegs so I can actually move them around a bit. Do you have an existing workflow for that? Right now I'm just using imagemagick convert in a for loop.

Sorry, I don't have the workflow. I just sampled 10,000 of them to feed the model directly for a trial right now. ><

Ha I do that as well. It is insane to me the number of things that just straight up break when you're dealing with lots of files.

It's all good though, I managed to figure it out:

find . -type f -name "*.jpg" | parallel mogrify -resize 256x {}

robvanvolt commented 3 years ago

I'll probably make another post once I'm finished training. I think i'm ultimately gonna go with a combination of all three datasets I've accrued so far: COCO2018, OpenImagesV6 and the ~1 million images from the OpenAI blog post. The size of openai's dataset is definitely discouraging though.

@robvanvolt I'm assuming there's a relatively easy way to get captioned images from wikipedia, no? That's what I'm after next.

Yes, it seems that on the 20th of March, 2021, there might be a solution which fits exactly our needs:

Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models. [...] We are hoping to make the WIT dataset available for download by March 20th, 2021. (tentatively).

https://github.com/google-research-datasets/wit

afiaka87 commented 3 years ago

Moving these to discussions.

afiaka87 commented 3 years ago

I'll probably make another post once I'm finished training. I think i'm ultimately gonna go with a combination of all three datasets I've accrued so far: COCO2018, OpenImagesV6 and the ~1 million images from the OpenAI blog post. The size of openai's dataset is definitely discouraging though. @robvanvolt I'm assuming there's a relatively easy way to get captioned images from wikipedia, no? That's what I'm after next.

Yes, it seems that on the 20th of March, 2021, there might be a solution which fits exactly our needs:

Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models. [...] We are hoping to make the WIT dataset available for download by March 20th, 2021. (tentatively).

https://github.com/google-research-datasets/wit

funny how fast things change, eh?