lucidrains / imagen-pytorch

Implementation of Imagen, Google's Text-to-Image Neural Network, in Pytorch
MIT License
7.89k stars 749 forks source link

Choice of T5-Large #2

Closed jorgemcgomes closed 2 years ago

jorgemcgomes commented 2 years ago

I was reading the paper, and in Figure A.6 (p.23) they compare many different encoder models: image (reproduced from https://arxiv.org/pdf/2205.11487.pdf)

Although T5-Base isnt mentioned anywhere else in the paper, according to this data it appears that T5-Base offers a performance similar to T5-Large (at least with the 300M params diffusion model they use for this comparison). Given the large performance difference between Small and Base, and the fact that T5-Base is ~3x smaller than Large but appears to perform similarly, wouldn't Base be the most sensible starting point?

I've noticed that this repo currently only offers Small and Large as options.

lucidrains commented 2 years ago

@jorgemcgomes oh yes, makes sense! https://github.com/lucidrains/imagen-pytorch/commit/b0fdd3dd0645c916aeed7cc535478d96874932ac

jorgemcgomes commented 2 years ago

Another point regarding T5 is that the paper likely uses T5 v1.1, not the original T5. Google has been releasing some models based on the 1.1 architecture, and they have shown that it offers better performance than the original one. It doesn't make much sense for them to go back to the original architecture I'd say.

And the paper does mention the T5 versions by the names "XL" and "XXL", which was only introduced with the release of v1.1. Before that, they called the larger models "3B" and "11B", which might be another clue that they're using the new architecture (open one of the links bellow to read the differences and the naming difference)

In this case, the model paths would be: https://huggingface.co/google/t5-v1_1-small https://huggingface.co/google/t5-v1_1-base https://huggingface.co/google/t5-v1_1-large https://huggingface.co/google/t5-v1_1-xxl

lucidrains commented 2 years ago

@jorgemcgomes got it! https://github.com/lucidrains/imagen-pytorch/commit/4425cf7aa9ac6e0b88d0f4002653eb16269397ea

rom1504 commented 2 years ago

I figure what would actually make sense is precompute the text encoder embeddings and then at training time you don't need to hold the model in vram nor do any text encoder forward

if doing that, using xxl makes the most sense

jorgemcgomes commented 2 years ago

I figure what would actually make sense is precompute the text encoder embeddings and then at training time you don't need to hold the model in vram nor do any text encoder forward

if doing that, using xxl makes the most sense

The counter-argument is that you still need to load the T5 model for inference. If the goal is to have a model that anybody can load in Colab or whatever, choosing T5-XXL is going to create trouble. And note that these T5 models (at least the HF implementation) used to have trouble with fp16 precision https://github.com/huggingface/transformers/issues/9295, so that could be 46GB vram for T5-XXL.

Edit: actually, about half that only, since only the encoder needs to be loaded. The XL version would be easily doable I think.

rom1504 commented 2 years ago

True but I bet it's possible to align a smaller t5 to the large t5 embedding space at minimal performance loss

jorgemcgomes commented 2 years ago

True but I bet it's possible to align a smaller t5 to the large t5 embedding space at minimal performance loss

That's a big unknown isn't it? I've never seen any attempt at that. Given that the XL encoder can be easily loaded, and the performance difference to XXL according to Imagen's paper isn't that large, it seems like a more practical bet for pre-computed embeddings.

Just tested this, I could load XL in Colab (Nvidia T4). It required 17GB RAM to load and ~6GB of VRAM. Also took the opportunity to test FP16 through model.half() and it didn't work -- no nans or anything, but the embeddings were way too different. So, I think relying on XXL will indeed create many practical issues for inference, but XL seems just fine.

rom1504 commented 2 years ago

There are many attempts at aligning encoders https://github.com/FreddeFrallan/Multilingual-CLIP is one example Dalle2 prior is another

rom1504 commented 2 years ago

But yeah you're right that XL will make things simpler

lucidrains commented 2 years ago

@jorgemcgomes thanks for all your help, and your recent PR!