accelerate launch train_dreambooth_lora_sdxl.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--pretrained_vae_model_name_or_path=$VAE_PATH \
--output_dir=$OUTPUT_DIR \
--mixed_precision="fp16" \
--instance_prompt="a photo of sks dog" \
--resolution=1024 \
--train_batch_size=1 \
--gradient_accumulation_steps=4 \
--learning_rate=1e-4 \
--report_to="wandb" \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=500 \
--validation_prompt="A photo of sks dog in a bucket" \
--validation_epochs=25 \
--seed="0" \
--push_to_hub
Logs
Using RTX 4000 series which doesn't support faster communication speedups. Ensuring P2P and IB communications are disabled.
11/20/2024 16:20:49 - INFO - __main__ - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda:0
Mixed precision type: fp16
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'dynamic_thresholding_ratio', 'thresholding', 'rescale_betas_zero_snr', 'variance_type', 'clip_sample_range'} was not found in config. Values will be initialized to default values.
{'use_quant_conv', 'mid_block_add_attention', 'shift_factor', 'latents_mean', 'use_post_quant_conv', 'latents_std'} was not found in config. Values will be initialized to default values.
{'attention_type', 'dropout', 'reverse_transformer_layers_per_block'} was not found in config. Values will be initialized to default values.
wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
wandb: Currently logged in as: yufeizhang. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.18.3
wandb: Run data is saved locally in /home/zyf/Documents/diffusers/examples/dreambooth/wandb/run-20241120_162108-mslhtw2v
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run happy-fog-1
wandb: ⭐️ View project at https://wandb.ai/yufeizhang/dreambooth-lora-sd-xl
wandb: 🚀 View run at https://wandb.ai/yufeizhang/dreambooth-lora-sd-xl/runs/mslhtw2v
11/20/2024 16:21:09 - INFO - __main__ - ***** Running training *****
11/20/2024 16:21:09 - INFO - __main__ - Num examples = 5
11/20/2024 16:21:09 - INFO - __main__ - Num batches each epoch = 5
11/20/2024 16:21:09 - INFO - __main__ - Num Epochs = 250
11/20/2024 16:21:09 - INFO - __main__ - Instantaneous batch size per device = 1
11/20/2024 16:21:09 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 4
11/20/2024 16:21:09 - INFO - __main__ - Gradient Accumulation steps = 4
11/20/2024 16:21:09 - INFO - __main__ - Total optimization steps = 500
Steps: 0%| | 0/500 [00:00<?, ?it/s][rank0]:[W1120 16:21:11.853197135 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
diffusion_pytorch_model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████| 335M/335M [01:16<00:00, 4.37MB/s]
Fetching 11 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [01:17<00:00, 7.05s/it]
{'image_encoder', 'feature_extractor'} was not found in config. Values will be initialized to default values.█████████| 11/11 [01:17<00:00, 7.05s/it]
Loaded tokenizer_2 as CLIPTokenizer from `tokenizer_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0. | 0/7 [00:00<?, ?it/s]
Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
{'rescale_betas_zero_snr', 'use_exponential_sigmas', 'sigma_min', 'timestep_type', 'sigma_max', 'final_sigmas_type', 'use_beta_sigmas'} was not found in config. Values will be initialized to default values.
Loaded scheduler as EulerDiscreteScheduler from `scheduler` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 49.88it/s]
11/20/2024 16:22:35 - INFO - __main__ - Running validation...
Generating 4 images with prompt: A photo of sks dog in a bucket.
{'algorithm_type', 'lower_order_final', 'euler_at_final', 'lambda_min_clipped', 'solver_order', 'thresholding', 'rescale_betas_zero_snr', 'dynamic_thresholding_ratio', 'use_exponential_sigmas', 'variance_type', 'final_sigmas_type', 'use_beta_sigmas', 'use_lu_lambdas', 'solver_type'} was not found in config. Values will be initialized to default values.
Steps: 0%|▎ | 2/500 [02:19<20:31, 2.47s/it, loss=0.0147, lr=0.0001]wandb: WARNING Tried to log to step 2 that is less than the current step 3. Steps must be monotonically increasing, so this data will be ignored. See https://wandb.me/define-metric to log data out of order.
wandb: WARNING Tried to log to step 2 that is less than the current step 3. Steps must be monotonically increasing, so this data will be ignored. See https://wandb.me/define-metric to log data out of order.
Steps: 0%|▎ | 2/500 [02:20<20:31, 2.47s/it, loss=0.00152, lr=0.0001]Traceback (most recent call last):
File "/home/zyf/Documents/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1994, in <module>
main(args)
File "/home/zyf/Documents/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1823, in main
accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
File "/home/zyf/anaconda3/envs/myenv/lib/python3.9/site-packages/accelerate/accelerator.py", line 2391, in clip_grad_norm_
self.unscale_gradients()
File "/home/zyf/anaconda3/envs/myenv/lib/python3.9/site-packages/accelerate/accelerator.py", line 2335, in unscale_gradients
self.scaler.unscale_(opt)
File "/home/zyf/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 338, in unscale_
optimizer_state["found_inf_per_device"] = self._unscale_grads_(
File "/home/zyf/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 260, in _unscale_grads_
raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/zyf/Documents/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1994, in <module>
[rank0]: main(args)
[rank0]: File "/home/zyf/Documents/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1823, in main
[rank0]: accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
[rank0]: File "/home/zyf/anaconda3/envs/myenv/lib/python3.9/site-packages/accelerate/accelerator.py", line 2391, in clip_grad_norm_
[rank0]: self.unscale_gradients()
[rank0]: File "/home/zyf/anaconda3/envs/myenv/lib/python3.9/site-packages/accelerate/accelerator.py", line 2335, in unscale_gradients
[rank0]: self.scaler.unscale_(opt)
[rank0]: File "/home/zyf/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 338, in unscale_
[rank0]: optimizer_state["found_inf_per_device"] = self._unscale_grads_(
[rank0]: File "/home/zyf/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/amp/grad_scaler.py", line 260, in _unscale_grads_
[rank0]: raise ValueError("Attempting to unscale FP16 gradients.")
[rank0]: ValueError: Attempting to unscale FP16 gradients.
wandb: 🚀 View run happy-fog-1 at: https://wandb.ai/yufeizhang/dreambooth-lora-sd-xl/runs/mslhtw2v
wandb: Find logs at: wandb/run-20241120_162108-mslhtw2v/logs
E1120 16:23:36.138844 36924 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 37051) of binary: /home/zyf/anaconda3/envs/myenv/bin/python
Traceback (most recent call last):
File "/home/zyf/anaconda3/envs/myenv/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/zyf/anaconda3/envs/myenv/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/zyf/anaconda3/envs/myenv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
multi_gpu_launcher(args)
File "/home/zyf/anaconda3/envs/myenv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/home/zyf/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/zyf/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/zyf/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_dreambooth_lora_sdxl.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-11-20_16:23:36
host : a03436ebd9bc
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 37051)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Describe the bug
when I was training dreambooth lora sdxl script on dag dataset, it output the errors as following: ValueError: Attempting to unscale FP16 gradients.
Reproduction
export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0" export INSTANCE_DIR="dog" export OUTPUT_DIR="lora-trained-xl" export VAE_PATH="madebyollin/sdxl-vae-fp16-fix"
accelerate launch train_dreambooth_lora_sdxl.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --instance_data_dir=$INSTANCE_DIR \ --pretrained_vae_model_name_or_path=$VAE_PATH \ --output_dir=$OUTPUT_DIR \ --mixed_precision="fp16" \ --instance_prompt="a photo of sks dog" \ --resolution=1024 \ --train_batch_size=1 \ --gradient_accumulation_steps=4 \ --learning_rate=1e-4 \ --report_to="wandb" \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --max_train_steps=500 \ --validation_prompt="A photo of sks dog in a bucket" \ --validation_epochs=25 \ --seed="0" \ --push_to_hub
Logs
System Info
transformers 4.46.3 pypi_0 pypi python 3.9.20 he870216_1 diffusers 0.32.0.dev0 pypi_0 pypi numpy 1.22.3 pypi_0 pypi torch 2.5.1 pypi_0 pypi torchaudio 0.12.1+cpu pypi_0 pypi torchvision 0.20.1 pypi_0 pypi
NVIDIA GeForce RTX 4090 NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2
Ubuntu 20.04.3 LTS
Who can help?
@sayakpaul @yiyixuxu