huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
26.2k stars 5.4k forks source link

fp16 overflow when combine scheduler that supports`rescale_betas_zero_snr=True` with InstructPix2PixPipeline #6981

Closed erliding closed 7 months ago

erliding commented 9 months ago

Describe the bug

InstructPix2Pix pipeline has a flag scheduler_is_in_sigma_space = hasattr(self.scheduler, "sigmas") which result latent value scale back and forth from sigma space for calculation of classifer free guidance, which seems not necessary, and especially when setting rescale_betas_zero_snr=True for e.g. EulerDiscreteScheduler, EulerAncestralDiscreteScheduler can generate black images with high rate due to fp16 overflow. While training instrcutp2p model with rescale_betas_zero_snr enabled is important to improve model preserve properties of input image

Reproduction

overflow can happen quite often from https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_instruct_pix2pix.py#L439, as rescale_betas_zero_snr scale max sigma from 14.61 to 4096

Logs

No response

System Info

main branch

Who can help?

@patil-suraj @sayakpaul

sayakpaul commented 9 months ago

Thanks for reporting.

While training instrcutp2p model with rescale_betas_zero_snr enabled is important to improve model preserve properties of input image

Can you elaborate this a bit more?

In any case, I will let @patil-suraj comment further here.

sapkun commented 9 months ago

I have similar problem when I train instruct pix2pix SDXL with fp16, inference of validation images are black during the training. If I change the fp16 to fp32, problem is solved.

sayakpaul commented 9 months ago

Did you try using a better VAE when doing the SDXL training?

erliding commented 9 months ago

Thanks for reporting.

While training instrcutp2p model with rescale_betas_zero_snr enabled is important to improve model preserve properties of input image

Can you elaborate this a bit more?

In any case, I will let @patil-suraj comment further here.

so basically it seems the gap between training with non-zero snr at last timestep and inference starting with gaussian matters more for instruct p2p, as during training model learns to edit image based on both the input image as condition and its diffused latent which always contains info about original input, but at inference time this piece of noisy info is not available, i find passing diffused input latent to model even with an alpha smaller than 0.0682 can make model perform better in terms of property preservation than starting with gaussian, similar effect can also be achieved by training model with zero terminal snr, this way it doesn’t matter if inference start with gaussian

sapkun commented 9 months ago

Did you try using a better VAE when doing the SDXL training?

No, I will try this. Thank you!

By the way, I have relevat question about instruct pix2pix SDXL training. When I using one gpu to train instruct pix2pix SDXL, it works, but training with multigpu, it is out of GPU memory. Following is my accelarate env and training config:

- `Accelerate` version: 0.27.0
- Platform: Linux-3.10.0-1160.el7.x86_64-x86_64-with-glibc2.10
- Python version: 3.8.12
- Numpy version: 1.24.4
- PyTorch version (GPU?): 2.2.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 1007.35 GB
- GPU type: NVIDIA A100-PCIE-40GB
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: fp16
        - use_cpu: False
        - debug: False
        - num_processes: 3
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: 0,1,2
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

accelerate launch --mixed_precision="fp16" --multi_gpu train_instruct_pix2pix_sdxl.py \
    --pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0 \
    --dataset_name=$DATASET_ID \
    --use_ema \
    --enable_xformers_memory_efficient_attention \
    --resolution=512 --random_flip \
    --train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \
    --max_train_steps=15000 \
    --checkpointing_steps=5000 --checkpoints_total_limit=1 \
    --learning_rate=5e-05 --lr_warmup_steps=0 \
    --conditioning_dropout_prob=0.05 \
    --seed=42 \
    --val_image_url_or_path="https://datasets-server.huggingface.co/assets/fusing/instructpix2pix-1000-samples/--/fusing--instructpix2pix-1000-samples/train/23/input_image/image.jpg" \
    --validation_prompt="make it in japan" \
    --use_8bit_adam \
    --report_to=wandb \
sayakpaul commented 9 months ago

You can refer to https://github.com/sayakpaul/instructpix2pix-sdxl for better utilization of GPUs.

sayakpaul commented 9 months ago

so basically it seems the gap between training with non-zero snr at last timestep and inference starting with gaussian matters more for instruct p2p, as during training model learns to edit image based on both the input image as condition and its diffused latent which always contains info about original input, but at inference time this piece of noisy info is not available, i find passing diffused input latent to model even with an alpha smaller than 0.0682 can make model perform better in terms of property preservation than starting with gaussian, similar effect can also be achieved by training model with zero terminal snr, this way it doesn’t matter if inference start with gaussian

Oh nice. Do you have some results for us to see? Just curious.

erliding commented 9 months ago

if up_cast for sdxl vae is enabled, which default to be true, and still have black images, probably due to the same reason i mentioned

erliding commented 9 months ago

so basically it seems the gap between training with non-zero snr at last timestep and inference starting with gaussian matters more for instruct p2p, as during training model learns to edit image based on both the input image as condition and its diffused latent which always contains info about original input, but at inference time this piece of noisy info is not available, i find passing diffused input latent to model even with an alpha smaller than 0.0682 can make model perform better in terms of property preservation than starting with gaussian, similar effect can also be achieved by training model with zero terminal snr, this way it doesn’t matter if inference start with gaussian

Oh nice. Do you have some results for us to see? Just curious.

sorry, not for my personal development, guess i am not allowed to share

sapkun commented 9 months ago

You can refer to https://github.com/sayakpaul/instructpix2pix-sdxl for better utilization of GPUs.

Thank you!

yiyixuxu commented 9 months ago

hi @erliding:

I agree these codes are unnecessary - you're welcome to open a PR to remove them :)

https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_instruct_pix2pix.py#L439

However, I don't think this is the reason that caused the fp16 overflow - note that we do this step inside the scheduler.step anyway https://github.com/huggingface/diffusers/blob/3a7e481611bc299416aaeed4207086d9ddca5852/src/diffusers/schedulers/scheduling_euler_ancestral_discrete.py#L411

sapkun commented 8 months ago

You can refer to https://github.com/sayakpaul/instructpix2pix-sdxl for better utilization of GPUs.

Hi!, I try to train completely according to the guidelines which is recommend to use webdataset, but still reported out of memory error for multiple GPUs, but it works with single GPU. It is very confusing !

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export MODEL_ID="/path/to/stable-diffusion-xl-base-1.0"
export DATASET_ID="/path/to/instructpix2pix-sdxl/upscale_swin2sr/instructpix2pix-clip-filtered_wds/{00000..00001}.tar"
export OUTPUT_DIR="sdxl-instructpix2pix"
export VAE_PATH="/path/to/sdxl-vae-fp16-fix"

accelerate launch --multi_gpu train_instruct_pix2pix_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_ID \
  --pretrained_vae_model_name_or_path=$VAE_PATH \
  --dataset_path="$DATASET_ID" \
  --use_ema \
  --enable_xformers_memory_efficient_attention \
  --resolution=256 --random_flip \
  --per_gpu_batch_size=1 --gradient_accumulation_steps=4 \
  --num_workers=1 \
  --max_train_steps=10000 \
  --checkpointing_steps=2500 \
  --learning_rate=1e-5 --lr_warmup_steps=0 \
  --mixed_precision=fp16 \
  --seed=42 \
  --output_dir=$OUTPUT_DIR \

- `Accelerate` version: 0.27.2
- Platform: Linux-3.10.0-1160.el7.x86_64-x86_64-with-glibc2.10
- Python version: 3.8.12
- Numpy version: 1.24.4
- PyTorch version (GPU?): 2.2.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 1007.35 GB
- GPU type: NVIDIA A100-PCIE-40GB
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: fp16
        - use_cpu: False
        - debug: False
        - num_processes: 3
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - deepspeed_config: {'gradient_accumulation_steps': 4, 'gradient_clipping': 1.0, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': False, 'zero_stage': 2}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []
sayakpaul commented 8 months ago

We are digressing from the original thread so, let’s not do that.

But what I can tell you is that the repository works perfectly fine on 8 80GB A100s. Other than that, I cannot help you much.

sapkun commented 8 months ago

We are digressing from the original thread so, let’s not do that.

But what I can tell you is that the repository works perfectly fine on 8 80GB A100s. Other than that, I cannot help you much.

I understand, but it's really weird that even when I use a very small dataset (only 500 image pairs), it doesn't work with multiple GPUs due to running out of memory. However, if I use a large dataset with a single GPU, it works fine.

Anyway, that's just how it is. Another thing I wanted to mention is that I'm eagerly awaiting the release of instructpix2pix with Stable Cascade. I hope it comes out soon.

Thank you!

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

sayakpaul commented 7 months ago

I think @erliding fixed this? So, I am closing the issue.