lucidrains / DALLE2-pytorch

Implementation of DALL-E 2, OpenAI's updated text-to-image synthesis neural network, in Pytorch
MIT License
11.03k stars 1.07k forks source link

Distributed Training of the Prior #103

Closed nousr closed 2 years ago

nousr commented 2 years ago

Initial training went well for the text-conditioned prior! 🚀

With the current text-conditioned script it is possible to hit ~32M data points in ~24 hours with ViT-L/14.

For reference, this translates to about 100k steps on a 40GB A100 gpu with a batch size of 320.

The (deep-image-prior) results from a 24 hour training run look really good, and I think it's time to scale up so that we can make a reasonable dent in the LAION2B dataset ⛰️

greg rutkowski mountains


(@lucidrains I know you're busy this week, and in general 😄, so feel free to get to this when you can--I just wanted to get the ball rolling)

I spent a little bit of time looking at this last night, but am by no means an expert in the subject so here are my biggest questions:

@rom1504 Is there anything we should keep in mind when working with EmbeddingReader here?

rom1504 commented 2 years ago

About embedding reader I suggest to use a single instance per node (since parallelism is already implemented inside), and to do the sampling using the start and stop parameters

rom1504 commented 2 years ago

Regarding the distributed training framework, you'll find that using ddp or deepspeed is the same level of difficulty, not very hard Deepspeed has more features DDP used to be limited but I think it's better with the latest version

My advice is trying to do something like in dalle pytorch : implementing it from a start in a distribution framework agnostic way. It'll save head aches in the future if we want to switch In practice it's just about defining an interface with the methods we want to use and then having several implementation Worth a try at least

nousr commented 2 years ago

I was able to get a minimal proof-of-concept working on 3 gpus tonight with DDP and the base DiffusionPrior class; just to get a feel for things.

I will start to build up the framework (borrowing when I can) as I verify things work as expected--for example,

jacobwjs commented 2 years ago

Why not use one of the many great frameworks? At this point not using one of them is akin to using one, albeit with much more involved boilerplate.

lucidrains commented 2 years ago

may be a good opportunity to see if https://github.com/huggingface/accelerate could work off the bat (before stepping up to deepspeed or fairscale). generally Sylvain builds great things

lucidrains commented 2 years ago

@nousr https://github.com/lucidrains/DALLE2-pytorch/blob/main/train_diffusion_prior.py#L75 for training the diffusion prior without CLIP with precomputed text embeddings, did you end up saving the text mask somewhere too? (or does training work fine without it?)

nousr commented 2 years ago

@nousr https://github.com/lucidrains/DALLE2-pytorch/blob/main/train_diffusion_prior.py#L75 for training the diffusion prior without CLIP with precomputed text embeddings, did you end up saving the text mask somewhere too? (or does training work fine without it?)

Training seems to work fine without it. All of those early runs from Krish were done with that, so you could probably dig through the other threads to find those WANDB reports if you want to compare!

lucidrains commented 2 years ago

@nousr good to know! jots down anecdata i've also noticed attention nets just figuring their way out around lack of masks, so its not critical at all (even if it is standard practice to mask out padding tokens) :)