Closed chg0901 closed 3 months ago
The approach outlined in this article seems to diverge from the typical diffusion models and diffusion Transformers. It appears to be an architecture that integrates CNNs with Transformers. As it's not specifically geared towards video generation, categorizing it might be a bit tricky.
The approach outlined in this article seems to diverge from the typical diffusion models and diffusion Transformers. It appears to be an architecture that integrates CNNs with Transformers. As it's not specifically geared towards video generation, categorizing it might be a bit tricky.
right, the main structure is Unet or AE,
Is there other good works that use CNN only without Transformer?
This issue is solved by #345 and we add a section called "diffusion UNet"
Add paper 'Taming Transformers for High-Resolution Image Synthesis' to 'Diffusion Transformer'
CVPR 21 paper: https://openaccess.thecvf.com/content/CVPR2021/papers/Esser_Taming_Transformers_for_High-Resolution_Image_Synthesis_CVPR_2021_paper.pdf
Github: https://github.com/CompVis/taming-transformers
Project: https://compvis.github.io/taming-transformers/