Closed leopoldmaillard closed 1 year ago
Maybe cc @anton-l ?
I guess this problem will be further addressed in the upcoming unit 2 Fine-Tuning and Guidance of the HF Diffusion Course !
Also, @lewtun mentioned "When dealing with higher-resolution inputs you may want to use more down and up-blocks, and keep the attention layers only at the lowest resolution (bottom) layers to reduce memory usage." in the course's introductory notebook.
cc @anton-l again here
Hi @leopoldmaillard! I haven't explored the DDPM hyperparameters extensively yet, so can't recommend anything concrete for resolutions higher than 64x64. But as a first step I would adjust the number of up/down blocks in a way that would leave you with depth*16*16
or depth*8*8
features for the middle block of the UNet. The configs of some pretrained DDPM models at https://huggingface.co/google might give you some inspiration: https://huggingface.co/google/ddpm-church-256/blob/main/config.json
Hello @anton-l, thank you for your insight !
I also found out that Dhariwal & Nichol discussed hyperparameters tuning of DDPM in their paper Diffusion Models Beat GANs on Image Synthesis.
Will close this for now !
Hi there ! I am currently training a DDPM model on a custom image dataset following the cool unconditional_image_generation example script.
Since I don't have the compute to perform comprehensive hyperparameter tuning of my architecture, I was wondering if there are any common intuitions when designing the
UNet
denoiser : width/length of the residual blocks, number and positions of the attention blocks, etc. with respect to the number of samples in the training set as well as their resolution.If anyone has a wide experience in training DMs, it would be super cool to share insights here or in a dedicated blog post such as the one discussing the hyperparameters choice when training Dreambooth.
Thank you ! 🤗