lucidrains / imagen-pytorch

Implementation of Imagen, Google's Text-to-Image Neural Network, in Pytorch
MIT License
7.91k stars 748 forks source link

Large scale training - input and help welcome #207

Open jacobwjs opened 1 year ago

jacobwjs commented 1 year ago

Not an issue per say, but looking for feedback from those that have done any large scale (100M+ text/image pairs) with multiple Unets (e.g. 3 stages leading to 1k x 1k resolution), and/or are interested in contributing to such an effort.

I'd like to organize a large scale training of Imagen that leads to open-sourced weights. Before this happens I'm hoping to get input from the community here. Questions to answer,

  1. What is the optimal configuration at this point for cascaded, conditional, super-resolution Imagen (with supporting data)?
  2. What safe, aesthetic, and open-source dataset(s) should be used that leads to the broadest use cases post training?
  3. Current blockers that need to be addressed before launching large-scale training?

All input is welcome!

xiankgx commented 1 year ago

I would suggest to take the latent diffusion approach. Everything here still works. Just need to rescale the latent values preferably to image values stats. On image space I needed to train for close to a week what can be achieved in one day. And there is no need for multistage training.

jacobwjs commented 1 year ago

I would suggest to take the latent diffusion approach. Everything here still works. Just need to rescale the latent values preferably to image values stats. On image space I needed to train for close to a week what can be achieved in one day. And there is no need for multistage training.

In that case why even use imagen-pytorch?

xiankgx commented 1 year ago

Here are some of my thoughts.

  1. We all want to train the most advanced models and I'm feeling openai has the most advanced models (dalle2) but the only open source implementation we have is by the same author here. You can try to piece various stuffs from openai repos but then you are on your own. There are multiple models in dalle 2: diffussion prior, decoder first stage, and decoder upsamplers. One part not working and nothing works.

  2. There are less moving parts in imagen, the conditioning model is fixed, and there are more chance of success.

  3. You get a lot of support from the author here, which cannot be said about some other implementations.

jacobwjs commented 1 year ago

Great points. My thoughts are below.

Here are some of my thoughts.

1. We all want to train the most advanced models and I'm feeling openai has the most advanced models (dalle2) but the only open source implementation we have is by the same author here. You can try to piece various stuffs from openai repos but then you are on your own. There are multiple models in dalle 2: diffussion prior, decoder first stage, and decoder upsamplers. One part not working and nothing works.

I'm not sure we can consider DALLE(2) SOTA anymore given the results from Imagen. And as you mention, it's the reason I haven't joined in on the fun from Phil's (lucidrains) dalle2-pytorch work. Imagen showed us a better, simplified, and interestingly scalable way through LLMs (large language models) and simple Unets.

2. There are less moving parts in imagen, the conditioning model is fixed, and there are more chance of success.

Agreed, which is why I don't want to start baking in latent based diffusion models without exploring the pre-trained LLMs and all they have to offer.

3. You get a lot of support from the author here, which cannot be said about some other implementations.

Absolutely. Phil and his work might be the best independent contributor happening to the open-source AI community at this point (yes there are many others). This is one major reason I want to try and bring Imagen-Pytorch to life.

Thanks for your thoughts!

jacobwjs commented 1 year ago

@lucidrains Hi Phil, have you heard anything from the community about getting imagen-pytorch trained on large scale dataset?

Also I had to step away for a bit, but just now catching up with all the updates. I see DDP being dropped in. Are you moving away from HF accelerate?

And something I need to think about for planning out future training, any thoughts on adding TPU support?

lucidrains commented 1 year ago

@jacobwjs why not work with the folks at Laion? Is stable diffusion not enough of a success to inspire you to work with them?

Nodja commented 1 year ago

Here's some thoughts.

I don't think anyone knows what the best hyperparameters are, specially for large scale, my intuition tells me that you're gonna have to do a bunch of runs to figure things out, specially in terms of final quality. I've noticed that quality of the model quickly improves the more varied the dataset is, at least for small unet dims, so if you're gonna try out runs, I recommend you at least use a dataset that is at least 10M large.

If your goal is to open source the weights, there's also the limitation of what your goal is in regards to your target audience, i.e. do you want people to run it locally on their 8GB GPUs, or is more for colabs and such where you need a 16GB GPU? This will set a limit on what the parameters should be, I'd say target 16GB inference with FP32 weights, with 8GB people being able to run the FP16 version.

There's also the big unknown of the text encoder, google said that t5-xxl is the best performing and the most impactful to image quality, what we don't know is why that is. It could be simply that t5-xxl generates bigger embedding (4096) that can better represent the ideas in a sentence, if that's true, maybe a t5-large with dmodel set to 4096 will perform just as good or near it for the purpose of image generation. If the goal is to have this running on consumer GPUs, t5-xxl is obviously out of the question, but maybe we can get 95% there with a custom encoder. Perhaps using the embeddings of a decoder only model like GPT would perform better, I've been using SGPT on my runs and it seems to condition properly as far as 1 feature goes, but my models are too small/undertrained for me to know if they'd be better/worse than T5. To me it seems that the text encoder is the key to solve all the issues these models have with prompts, even dalle2 does not listen to prompts properly and simply ignores certain aspects, but perhaps I'm jumping the gun a little here and we should keep it simple for now.

I think a good approach would be to just start with the hyperpams in the paper using elucidated instead of normal diffusion and not using t5-xxl, but something more modest like t5-large or t5-xl. Then from there check the results and improve things iteratively, like adding the recently added self conditioning, among the many other things lucid has added to the repo.

lucidrains commented 1 year ago

@lucidrains Hi Phil, have you heard anything from the community about getting imagen-pytorch trained on large scale dataset?

Also I had to step away for a bit, but just now catching up with all the updates. I see DDP being dropped in. Are you moving away from HF accelerate?

And something I need to think about for planning out future training, any thoughts on adding TPU support?

where do you see DDP being dropped?

for TPU support, the best way would be to write a jax port. Pytorch XLA isn't that great yet

@nousr and @Veldrovive have already scaled up dalle2-pytorch to 800 GPUs on Stability AI resources, so huggingface accelerate will work fine. we will stick with that framework, through thick and thin

jacobwjs commented 1 year ago

@jacobwjs why not work with the folks at Laion? Is stable diffusion not enough of a success to inspire you to work with them?

Happy to contribute, but I don't see this effort as misaligned. I also don't think more of the same is what the world needs, and more concentration into a single effort. There is a lacking why and mission for me, and I'd like to pick up work on the pieces I think are missing. Safety shouldn't be an afterthought. We should have more creative alignment. Bias. The list goes on and on for me. There are some deep questions left to answer that I'd like to focus on :)

jacobwjs commented 1 year ago

Here's some thoughts.

I don't think anyone knows what the best hyperparameters are, specially for large scale, my intuition tells me that you're gonna have to do a bunch of runs to figure things out, specially in terms of final quality. I've noticed that quality of the model quickly improves the more varied the dataset is, at least for small unet dims, so if you're gonna try out runs, I recommend you at least use a dataset that is at least 10M large.

Yes definitely need to do some informed, rigorous discovery on best params for a large model. There was some work in the past by myself and a few others here, but things have changed considerably. Phil's speed is simultaneously a blessing and a curse hahah. I agree a well rounded and sufficiently large dataset is key for the first real first attempts. I think there is still some work to ensure convergence even before that happen though. It would be great to get a list going for each stage in the development that justifies hparams before kicking off the final go at it.

If your goal is to open source the weights, there's also the limitation of what your goal is in regards to your target audience, i.e. do you want people to run it locally on their 8GB GPUs, or is more for colabs and such where you need a 16GB GPU? This will set a limit on what the parameters should be, I'd say target 16GB inference with FP32 weights, with 8GB people being able to run the FP16 version.

Agreed and definitely something that needs to be balanced. I feel it would be a bad choice targeting lowest end commodity hardware, especially given that free compute with 16GB VRAM exists (Colab, AWS Studio), and paid options for 24GB VRAM are quite low as of today (a little over $1 USD per hour) and will only be getting cheaper.

There's also the big unknown of the text encoder, google said that t5-xxl is the best performing and the most impactful to image quality, what we don't know is why that is. It could be simply that t5-xxl generates bigger embedding (4096) that can better represent the ideas in a sentence, if that's true, maybe a t5-large with dmodel set to 4096 will perform just as good or near it for the purpose of image generation. If the goal is to have this running on consumer GPUs, t5-xxl is obviously out of the question, but maybe we can get 95% there with a custom encoder. Perhaps using the embeddings of a decoder only model like GPT would perform better, I've been using SGPT on my runs and it seems to condition properly as far as 1 feature goes, but my models are too small/undertrained for me to know if they'd be better/worse than T5. To me it seems that the text encoder is the key to solve all the issues these models have with prompts, even dalle2 does not listen to prompts properly and simply ignores certain aspects, but perhaps I'm jumping the gun a little here and we should keep it simple for now.

We've had some basic discussion on the LM before somewhere in a thread here. This is a big reason why I feel Imagen is such an interesting model, and I think will have the most impact between what we describe and what is generated (what I'm calling creative alignment). I'd be interested in exploring other options besides T5 and derivatives, there's nothing special about it other than Google made it, and made multiple versions with varying capacity. A GPT derivative as you mention would be interesting to test. As you mention, whatever the choice this will need to be balanced based on broadly available VRAM (I think 16GB should be a driving upper limit for inference).

I think a good approach would be to just start with the hyperpams in the paper using elucidated instead of normal diffusion and not using t5-xxl, but something more modest like t5-large or t5-xl. Then from there check the results and improve things iteratively, like adding the recently added self conditioning, among the many other things lucid has added to the repo.

Great first steps. I'll piece together some next steps and update here accordingly.

jacobwjs commented 1 year ago

@lucidrains Hi Phil, have you heard anything from the community about getting imagen-pytorch trained on large scale dataset? Also I had to step away for a bit, but just now catching up with all the updates. I see DDP being dropped in. Are you moving away from HF accelerate? And something I need to think about for planning out future training, any thoughts on adding TPU support?

where do you see DDP being dropped?

for TPU support, the best way would be to write a jax port. Pytorch XLA isn't that great yet

@nousr and @Veldrovive have already scaled up dalle2-pytorch to 800 GPUs on Stability AI resources, so huggingface accelerate will work fine. we will stick with that framework, through thick and thin

Sorry just briefly scanned over some latest commits and saw explicit DDP support being added in? Feel free to ignore, haven't really poured over latest updates.

Great to hear everything is scaling! On the idea of porting over to JAX, I will avoid that at all costs just from a maintenance standpoint. Hoping that perhaps HF accelerate would make the jump over to TPU possible. Not really sure where that sits on their roadmap though. Perhaps I'll look into next week if I find some time.

Mut1nyJD commented 1 year ago

I think latent diffusion you always run into the problem with the codebook which is already limited so there will be a barrier on the quality as the codebook can only hold that much visual information also I don't think a learned quantized encoder is actually the best encoder for images and google's approach looking into using DCT is valid as DCT has been shown to be a very very successful encoder/decoder in the image/video space and you don't need that extra training step which you also have to train on a significant dataset to get a broad codebook.

I have the hunch that Imagegen is actually superior to Latent Diffusion as used in Stable Diffusion and definitely superior to DALLE-2

lucidrains commented 1 year ago

@Mut1nyJD yea, i'll chip away at https://github.com/lucidrains/transframer-pytorch and hopefully get some nice reusable functions and modules for DCT so we can further exploration there

xiankgx commented 1 year ago

I think latent diffusion you always run into the problem with the codebook which is already limited so there will be a barrier on the quality as the codebook can only hold that much visual information also I don't think a learned quantized encoder is actually the best encoder for images and google's approach looking into using DCT is valid as DCT has been shown to be a very very successful encoder/decoder in the image/video space and you don't need that extra training step which you also have to train on a significant dataset to get a broad codebook.

I have the hunch that Imagegen is actually superior to Latent Diffusion as used in Stable Diffusion and definitely superior to DALLE-2

You have a good point with the DCT. But latent diffusion is a broad concept, not necessary just VQVAE or VQGAN. It could just be a standard VAE as is with stable fusion right now. Or it can be an autoencoder with DCT layers.