"ValueError: Attempting to unscale FP16 gradients" for training dreambooth lora sdxl script

Describe the bug

when I was training dreambooth lora sdxl script on dag dataset, it output the errors as following: ValueError: Attempting to unscale FP16 gradients.

Reproduction

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0" export INSTANCE_DIR="dog" export OUTPUT_DIR="lora-trained-xl" export VAE_PATH="madebyollin/sdxl-vae-fp16-fix"

accelerate launch train_dreambooth_lora_sdxl.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --instance_data_dir=$INSTANCE_DIR \ --pretrained_vae_model_name_or_path=$VAE_PATH \ --output_dir=$OUTPUT_DIR \ --mixed_precision="fp16" \ --instance_prompt="a photo of sks dog" \ --resolution=1024 \ --train_batch_size=1 \ --gradient_accumulation_steps=4 \ --learning_rate=1e-4 \ --report_to="wandb" \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --max_train_steps=500 \ --validation_prompt="A photo of sks dog in a bucket" \ --validation_epochs=25 \ --seed="0" \ --push_to_hub

Logs

Using RTX 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled.
11/20/2024 16:20:49 - INFO - __main__ - Distributed environment: MULTI_GPU  Backend: nccl
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'dynamic_thresholding_ratio', 'thresholding', 'rescale_betas_zero_snr', 'variance_type', 'clip_sample_range'} was not found in config. Values will be initialized to default values.
{'use_quant_conv', 'mid_block_add_attention', 'shift_factor', 'latents_mean', 'use_post_quant_conv', 'latents_std'} was not found in config. Values will be initialized to default values.
{'attention_type', 'dropout', 'reverse_transformer_layers_per_block'} was not found in config. Values will be initialized to default values.
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: yufeizhang. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.18.3
wandb: Run data is saved locally in /home/zyf/Documents/diffusers/examples/dreambooth/wandb/run-20241120_162108-mslhtw2v
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run happy-fog-1
wandb: ⭐️ View project at https://wandb.ai/yufeizhang/dreambooth-lora-sd-xl
wandb: 🚀 View run at https://wandb.ai/yufeizhang/dreambooth-lora-sd-xl/runs/mslhtw2v
11/20/2024 16:21:09 - INFO - __main__ - ***** Running training *****
11/20/2024 16:21:09 - INFO - __main__ -   Num examples = 5
11/20/2024 16:21:09 - INFO - __main__ -   Num batches each epoch = 5
11/20/2024 16:21:09 - INFO - __main__ -   Num Epochs = 250
11/20/2024 16:21:09 - INFO - __main__ -   Instantaneous batch size per device = 1
11/20/2024 16:21:09 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 4
11/20/2024 16:21:09 - INFO - __main__ -   Gradient Accumulation steps = 4
11/20/2024 16:21:09 - INFO - __main__ -   Total optimization steps = 500
Steps:   0%|                                                                                                                  | 0/500 [00:00<?, ?it/s][rank0]:[W1120 16:21:11.853197135 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
diffusion_pytorch_model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████| 335M/335M [01:16<00:00, 4.37MB/s]
Fetching 11 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [01:17<00:00,  7.05s/it]
{'image_encoder', 'feature_extractor'} was not found in config. Values will be initialized to default values.█████████| 11/11 [01:17<00:00,  7.05s/it]
                                                                                                                                                     Loaded tokenizer_2 as CLIPTokenizer from `tokenizer_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                    | 0/7 [00:00<?, ?it/s]
Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
                                                                                                                                                     {'rescale_betas_zero_snr', 'use_exponential_sigmas', 'sigma_min', 'timestep_type', 'sigma_max', 'final_sigmas_type', 'use_beta_sigmas'} was not found in config. Values will be initialized to default values.
Loaded scheduler as EulerDiscreteScheduler from `scheduler` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 49.88it/s]
11/20/2024 16:22:35 - INFO - __main__ - Running validation... 
 Generating 4 images with prompt: A photo of sks dog in a bucket.
{'algorithm_type', 'lower_order_final', 'euler_at_final', 'lambda_min_clipped', 'solver_order', 'thresholding', 'rescale_betas_zero_snr', 'dynamic_thresholding_ratio', 'use_exponential_sigmas', 'variance_type', 'final_sigmas_type', 'use_beta_sigmas', 'use_lu_lambdas', 'solver_type'} was not found in config. Values will be initialized to default values.
Steps:   0%|▎                                                                                 | 2/500 [02:19<20:31,  2.47s/it, loss=0.0147, lr=0.0001]wandb: WARNING Tried to log to step 2 that is less than the current step 3. Steps must be monotonically increasing, so this data will be ignored. See https://wandb.me/define-metric to log data out of order.
wandb: WARNING Tried to log to step 2 that is less than the current step 3. Steps must be monotonically increasing, so this data will be ignored. See https://wandb.me/define-metric to log data out of order.
Steps:   0%|▎                                                                                | 2/500 [02:20<20:31,  2.47s/it, loss=0.00152, lr=0.0001]Traceback (most recent call last):
  File "/home/zyf/Documents/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1994, in <module>
    main(args)
  File "/home/zyf/Documents/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1823, in main
    accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
  File "/home/zyf/anaconda3/envs/myenv/lib/python3.9/site-packages/accelerate/accelerator.py", line 2391, in clip_grad_norm_
    self.unscale_gradients()
  File "/home/zyf/anaconda3/envs/myenv/lib/python3.9/site-packages/accelerate/accelerator.py", line 2335, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/home/zyf/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 338, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/home/zyf/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 260, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/zyf/Documents/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1994, in <module>
[rank0]:     main(args)
[rank0]:   File "/home/zyf/Documents/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1823, in main
[rank0]:     accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
[rank0]:   File "/home/zyf/anaconda3/envs/myenv/lib/python3.9/site-packages/accelerate/accelerator.py", line 2391, in clip_grad_norm_
[rank0]:     self.unscale_gradients()
[rank0]:   File "/home/zyf/anaconda3/envs/myenv/lib/python3.9/site-packages/accelerate/accelerator.py", line 2335, in unscale_gradients
[rank0]:     self.scaler.unscale_(opt)
[rank0]:   File "/home/zyf/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 338, in unscale_
[rank0]:     optimizer_state["found_inf_per_device"] = self._unscale_grads_(
[rank0]:   File "/home/zyf/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 260, in _unscale_grads_
[rank0]:     raise ValueError("Attempting to unscale FP16 gradients.")
[rank0]: ValueError: Attempting to unscale FP16 gradients.
wandb: 🚀 View run happy-fog-1 at: https://wandb.ai/yufeizhang/dreambooth-lora-sd-xl/runs/mslhtw2v
wandb: Find logs at: wandb/run-20241120_162108-mslhtw2v/logs
E1120 16:23:36.138844 36924 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 37051) of binary: /home/zyf/anaconda3/envs/myenv/bin/python
Traceback (most recent call last):
  File "/home/zyf/anaconda3/envs/myenv/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/zyf/anaconda3/envs/myenv/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/zyf/anaconda3/envs/myenv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
    multi_gpu_launcher(args)
  File "/home/zyf/anaconda3/envs/myenv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/zyf/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/home/zyf/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/zyf/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train_dreambooth_lora_sdxl.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-11-20_16:23:36
  host      : a03436ebd9bc
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 37051)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

System Info

transformers 4.46.3 pypi_0 pypi python 3.9.20 he870216_1 diffusers 0.32.0.dev0 pypi_0 pypi numpy 1.22.3 pypi_0 pypi torch 2.5.1 pypi_0 pypi torchaudio 0.12.1+cpu pypi_0 pypi torchvision 0.20.1 pypi_0 pypi

NVIDIA GeForce RTX 4090 NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2

Ubuntu 20.04.3 LTS

Who can help?

@sayakpaul @yiyixuxu

huggingface / diffusers