Any thoughts on this? - Githubissues

lucidrains / imagen-pytorch

Implementation of Imagen, Google's Text-to-Image Neural Network, in Pytorch

MIT License

8.05k stars 763 forks source link

Any thoughts on this? #250

Open Mut1nyJD opened 1 year ago

Mut1nyJD commented 1 year ago

Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for Text-to-Image Generation https://arxiv.org/abs/2210.09549

I really wonder if a SwinTransformer based UNet really is that much better?

DaiQiangReal commented 1 year ago

Hoping for implementation，lol

lucidrains commented 1 year ago

@Mut1nyJD i doubt it mainly because the current unet you can place as many self attention blocks as you wish at the end of each stage

not sure what Swin has over that besides being able to have it at earlier blocks, but I have given the option for linear attention there

lucidrains commented 1 year ago

@Mut1nyJD i do agree that there is still room for more research on optimal unet design though

Mut1nyJD commented 1 year ago

@lucidrains Yes I agree not sure what it actually brings to the table would have to read it more thoroughly what the rational behind it is and if they have ablation studies on different aspects in their paper but then I also have no real life experience with Swin architecture in general.

Sorry bit lazy but have not looked in more detail but does your ImageGen UNet implementation differ much from the one in your denoising diffusion branch?

lucidrains commented 1 year ago

@Mut1nyJD i'm not a fan of Swin tbh, there are better alternatives

yea it is a bit different, more customizable for sure

Mut1nyJD commented 1 year ago

@lucidrains

Ok thank you for the quick answer I will have a closer look at the implementation then.

Still have not come around to try Imagen as text condition really is not something I currently have use for yet. So not seeing the big advantage of Imagen outside text conditioning compared to non conditioned diffusion denoising. Maybe besides the two UNet system for upscaling and therefore potentially faster convergence and training.

Mut1nyJD commented 1 year ago

Hi @lucidrains

Not sure if you have seen the latest work from nVidia:

https://deepimagination.cc/eDiffi/

One thing of note is that it kind of proofs the general feeling that T5 LLM in general is a better embedding,/ prior than CLIP.

However two interesting bits:

Adding CLIP as an additional embedding next to T5 LLM seems to makes the image follow the text even better
MoE UNets is an interesting extension although I personally think it makes training and sampling unnecessary more complicated as you have to swap between nets and the training schedule is more complex, but I see their point with the attention maps becoming less effective in later time steps. Just wonder if there is not an easier way to achieve the same