Working with images and text embeddings of different shape

I was wondering if I can use latents from a VAE as an input to Imagen UNet, just like latent diffusion models. But the issue is that VAE change the shape of the image (e.g. 1 Dimentional array or 4 channel images). What do I have to change to be able to do that?

I was also wondering about changing the text embeddings used. The issue is also that they might have different shapes. Is it feasible to use other embeddings with minimal code change?

Thank You!

lucidrains / imagen-pytorch

Working with images and text embeddings of different shape #336