sunovivid / Perturbed-Attention-Guidance

Official implementation of "Perturbed-Attention Guidance"
MIT License
51 stars 13 forks source link

Number of samples and guidance scales for reported results on imagenet 256 using adm??? #7

Open black0017 opened 1 week ago

black0017 commented 1 week ago

Hey @sunovivid, great work, and congrats on the paper's acceptance at ECCV!

I would like to reproduce the results, and I have the following questions related to the hyperparameters:

  1. How many samples did you use to report FID? The bash script shows 5K, is that right?

  2. What is the guidance scale to get the optimal FID, as shown in the paper's Table 1? Readme shows the following (without classifier guidance here for the conditional ADM case at 256x256):

    SAMPLE_FLAGS="--timestep_respacing 250 --use_ddim False --classifier_guidance False --classifier_scale 0.0  --guide_scale 4   --guidance_strategies "{\"attention_mask_identity\":[250,0]}"   --drop_layers input_blocks.14.1 input_blocks.16.1 input_blocks.17.1 middle_block.1 "
  3. Can I apply the same code for the ImageNet 128x128 models and resolution, or do I need to specify other attention maps?

image

It would be extremely helpful as sampling is super slow with ADM. Thanks a lot!!!!

sunovivid commented 1 week ago

Hi! Thank you for your interest in our work.

  1. We used 50K samples as suggested in 5.2 Pixel-Level Diffusion Models.
  2. We also leave the detailed hyperparameter settings, like guidance scale, in Appendix A.1 Experiments on ADM Quantitative results.. We use PAG scale s = 1.0 (in our formulation, guidance scale starts from 0.0), which corresponds to CFG scale 2.0 used in diffusers SD or SDXL (starts from 1.0).

If you have further questions, feel free to let me know!

black0017 commented 1 week ago

Thank you so much for the clarifications! I 've missed that from the paper and appendix.

It takes a lot of time and resources to compute the ADM samples at 256 resolution. I estimate that it will take more than 48h in 4x Nvidia a100 GPUs. Is this to be expected?

Thanks again and have a nice day! I will let you know if I am able to get similar FID values!

black0017 commented 1 week ago

Another thing: Do you provide any results to reproduce your results on the ADM 128x128 cond. model??

Do you think I can still use the same attention maps ( the unet architecture is slightly different)? Please let me know if you have these experiments (PAG on ADM128x128 cond model) as I am a bit constrained with the amount of experiments I can run. Thanks!

sunovivid commented 3 days ago

Hi, I missed the comments. I'm sorry for the late reply.

It takes a lot of time and resources to compute the ADM samples at 256 resolution. I estimate that it will take more than 48h in 4x Nvidia a100 GPUs. Is this to be expected?

Yes, we evaluated FID using an 8x Nvidia 3090 GPU for about 52 hours. It was a very tough time.

For the reported results (table 1), are you using DDIM-25, which I assume is specified in --timestep_respacing ddim25 or DDPM-250 (--timestep_respacing 250 ) ?

We used DDPM-250 to ensure the same setting with SAG.

I guess --guide_scale 1 from the paper's formulation is the one I need to specify for the ADM samples.

We used --guidance_scale 2.0, as our codebase has a guidance scale which starts from 1.0. (0.0 = uncond, 1.0 = cond, please refer to gaussian_diffusion/gaussian_diffusion.py.

I hope you can achieve the same results! If you have more specification, feel free to tell us.

sunovivid commented 3 days ago

Another thing: Do you provide any results to reproduce your results on the ADM 128x128 cond. model??

We tested it on ImageNet 128 in ealier days. It works quite well. But we didn't report this because ADM does not have an unconditional model.

It has a slightly different architecture than 256 models, with fewer attention layers. But overall architecture is similar, and it will work well if you perturb near the m layers, for example, i13, i14, m, o2, o5, o6.

I attached results from ImageNet 128 for reference, although it is a dropout on a self-attention map, not the replacing self-attention map to an identity matrix. The identity matrix works even better.

image image image

So I suggest you try PAG on the 128x128 model, which can reduce the total evaluation time by a large margin.