Number of samples and guidance scales for reported results on imagenet 256 using adm???

sunovivid / Perturbed-Attention-Guidance

Official implementation of "Perturbed-Attention Guidance"

MIT License

53 stars 15 forks source link

Number of samples and guidance scales for reported results on imagenet 256 using adm??? #7

Open black0017 opened 1 month ago

black0017 commented 1 month ago

Hey @sunovivid, great work, and congrats on the paper's acceptance at ECCV!

I would like to reproduce the results, and I have the following questions related to the hyperparameters:

How many samples did you use to report FID? The bash script shows 5K, is that right?

What is the guidance scale to get the optimal FID, as shown in the paper's Table 1? Readme shows the following (without classifier guidance here for the conditional ADM case at 256x256):

SAMPLE_FLAGS="--timestep_respacing 250 --use_ddim False --classifier_guidance False --classifier_scale 0.0  --guide_scale 4   --guidance_strategies "{\"attention_mask_identity\":[250,0]}"   --drop_layers input_blocks.14.1 input_blocks.16.1 input_blocks.17.1 middle_block.1 "

Can I apply the same code for the ImageNet 128x128 models and resolution, or do I need to specify other attention maps?

It would be extremely helpful as sampling is super slow with ADM. Thanks a lot!!!!

sunovivid commented 1 month ago

Hi! Thank you for your interest in our work.

We used 50K samples as suggested in 5.2 Pixel-Level Diffusion Models.
We also leave the detailed hyperparameter settings, like guidance scale, in Appendix A.1 Experiments on ADM Quantitative results.. We use PAG scale s = 1.0 (in our formulation, guidance scale starts from 0.0), which corresponds to CFG scale 2.0 used in diffusers SD or SDXL (starts from 1.0).

If you have further questions, feel free to let me know!

black0017 commented 1 month ago

Thank you so much for the clarifications! I 've missed that from the paper and appendix.

It takes a lot of time and resources to compute the ADM samples at 256 resolution. I estimate that it will take more than 48h in 4x Nvidia a100 GPUs. Is this to be expected?

For the reported results (table 1), are you using DDIM-25, which I assume is specified in --timestep_respacing ddim25 or DDPM-250 (--timestep_respacing 250 ) ?
I guess --guide_scale 1 from the paper's formulation is the one I need to specify for the ADM samples.

Thanks again and have a nice day! I will let you know if I am able to get similar FID values!

black0017 commented 1 month ago

Another thing: Do you provide any results to reproduce your results on the ADM 128x128 cond. model??

Do you think I can still use the same attention maps ( the unet architecture is slightly different)? Please let me know if you have these experiments (PAG on ADM128x128 cond model) as I am a bit constrained with the amount of experiments I can run. Thanks!

sunovivid commented 1 month ago

Hi, I missed the comments. I'm sorry for the late reply.

It takes a lot of time and resources to compute the ADM samples at 256 resolution. I estimate that it will take more than 48h in 4x Nvidia a100 GPUs. Is this to be expected?

Yes, we evaluated FID using an 8x Nvidia 3090 GPU for about 52 hours. It was a very tough time.

For the reported results (table 1), are you using DDIM-25, which I assume is specified in --timestep_respacing ddim25 or DDPM-250 (--timestep_respacing 250 ) ?

We used DDPM-250 to ensure the same setting with SAG.

I guess --guide_scale 1 from the paper's formulation is the one I need to specify for the ADM samples.

We used --guidance_scale 2.0, as our codebase has a guidance scale which starts from 1.0. (0.0 = uncond, 1.0 = cond, please refer to gaussian_diffusion/gaussian_diffusion.py.

I hope you can achieve the same results! If you have more specification, feel free to tell us.

sunovivid commented 1 month ago

Another thing: Do you provide any results to reproduce your results on the ADM 128x128 cond. model??

We tested it on ImageNet 128 in ealier days. It works quite well. But we didn't report this because ADM does not have an unconditional model.

It has a slightly different architecture than 256 models, with fewer attention layers. But overall architecture is similar, and it will work well if you perturb near the m layers, for example, i13, i14, m, o2, o5, o6.

I attached results from ImageNet 128 for reference, although it is a dropout on a self-attention map, not the replacing self-attention map to an identity matrix. The identity matrix works even better.

So I suggest you try PAG on the 128x128 model, which can reduce the total evaluation time by a large margin.

black0017 commented 2 weeks ago

Hi @sunovivid, I have gotten an FID of ~19 for the unconditional ADM (256x256 res), while in the paper, you reported 16.23.

Suggestion: Would it possible to send me the png or numpy files to evaluate it? I am running evaluations apart from FID and IS with additional metrics and I would like to report your method in our paper.

This is what I run based on my understanding of the hyperparameters. Feel free to let me know if I got something wrong.

SAMPLE_FLAGS="--batch_size 32 --num_samples 50000 --timestep_respacing 250 --use_ddim False --classifier_guidance False --classifier_scale 0.0 --class_cond False"
MODEL_FLAGS="--attention_resolutions 32,16,8 --diffusion_steps 1000 --image_size 256 --learn_sigma True --noise_schedule linear --num_channels 256 --num_head_channels 64 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True"

scale=2
mpiexec classifier_sample.py \
    --guide_scale $scale \
    --guidance_strategies "{\"attention_mask_identity\":[250,0]}" \
    --drop_layers input_blocks.14.1 input_blocks.16.1 input_blocks.17.1 middle_block.1 \
    $SAMPLE_FLAGS \
    $MODEL_FLAGS \

PS: I have removed some of the variables of my local paths etc

Thanks again for being very responsive, and have a nice day!

N.A.

sunovivid commented 2 weeks ago

Hi, I'm sorry to hear the FID score is differing. This might be due to stochastic variations in the sampling process. We use different settings for each experiment for the number of gpus and gpu id, and as diffusion's generation quality severely depends on initial noise (and also we use ddpm here), the output images can differ, and thus, FID can vary to some extent.

Fortunately, I still have the .npz file we used for evaluating the unconditional model, even though the exact script that produced it is no longer available. I'll send the file over so you can use it for accurate comparisons. I hope this helps with your additional metric evaluations, and I’m looking forward to seeing your paper!

Please leave your email address here or by email! my email: dhahn99@korea.ac.kr