Distributed Training of the Prior

nousr commented 2 years ago

Initial training went well for the text-conditioned prior! 🚀

With the current text-conditioned script it is possible to hit ~32M data points in ~24 hours with ViT-L/14.

For reference, this translates to about 100k steps on a 40GB A100 gpu with a batch size of 320.

The (deep-image-prior) results from a 24 hour training run look really good, and I think it's time to scale up so that we can make a reasonable dent in the LAION2B dataset ⛰️

greg rutkowski mountains

(@lucidrains I know you're busy this week, and in general 😄, so feel free to get to this when you can--I just wanted to get the ball rolling)

I spent a little bit of time looking at this last night, but am by no means an expert in the subject so here are my biggest questions:

What framework should be adopted to implement this?
- I saw there was a discussion about deepspeed/fairscale...does it make sense to jump on this right away?
Can we take a baby-step and utilize something like Pytorch DDP?
- Are there any pit-falls to be aware of with DDP and the current code-base?
Are there any good repo's that can be used as a guideline for implementing distributed training?

@rom1504 Is there anything we should keep in mind when working with EmbeddingReader here?

rom1504 commented 2 years ago

About embedding reader I suggest to use a single instance per node (since parallelism is already implemented inside), and to do the sampling using the start and stop parameters

rom1504 commented 2 years ago

Regarding the distributed training framework, you'll find that using ddp or deepspeed is the same level of difficulty, not very hard Deepspeed has more features DDP used to be limited but I think it's better with the latest version

My advice is trying to do something like in dalle pytorch : implementing it from a start in a distribution framework agnostic way. It'll save head aches in the future if we want to switch In practice it's just about defining an interface with the methods we want to use and then having several implementation Worth a try at least

nousr commented 2 years ago

I was able to get a minimal proof-of-concept working on 3 gpus tonight with DDP and the base DiffusionPrior class; just to get a feel for things.

I will start to build up the framework (borrowing when I can) as I verify things work as expected--for example,

[x] I ran into an issue with DiffusionPriorTrainer that I need to look into, simply wrapping it in DDP caused some issues in the models not learning. solution edit: turns out calling trainer.update() may be useful 😅

jacobwjs commented 2 years ago

Why not use one of the many great frameworks? At this point not using one of them is akin to using one, albeit with much more involved boilerplate.

lucidrains commented 2 years ago

may be a good opportunity to see if https://github.com/huggingface/accelerate could work off the bat (before stepping up to deepspeed or fairscale). generally Sylvain builds great things

lucidrains commented 2 years ago

@nousr https://github.com/lucidrains/DALLE2-pytorch/blob/main/train_diffusion_prior.py#L75 for training the diffusion prior without CLIP with precomputed text embeddings, did you end up saving the text mask somewhere too? (or does training work fine without it?)

nousr commented 2 years ago

@nousr https://github.com/lucidrains/DALLE2-pytorch/blob/main/train_diffusion_prior.py#L75 for training the diffusion prior without CLIP with precomputed text embeddings, did you end up saving the text mask somewhere too? (or does training work fine without it?)

Training seems to work fine without it. All of those early runs from Krish were done with that, so you could probably dig through the other threads to find those WANDB reports if you want to compare!

lucidrains commented 2 years ago

@nousr good to know! jots down anecdata i've also noticed attention nets just figuring their way out around lack of masks, so its not critical at all (even if it is standard practice to mask out padding tokens) :)

lucidrains / DALLE2-pytorch

Distributed Training of the Prior #103