huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
25.86k stars 5.33k forks source link

ValueError: Attempting to unscale FP16 gradients. when running examples/text_to_image/train_text_to_image_lora.py #7330

Closed HemalPatil closed 3 months ago

HemalPatil commented 7 months ago

Describe the bug

Tried following the LoRA training example from huggingface tutorials using the editable install. Failed.

Reproduction

Default accelerate config:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: 'NO'
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Launch script:

export MODEL_NAME="runwayml/stable-diffusion-v1-5"

accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$DATASET_NAME \
  --dataloader_num_workers=8 \
  --resolution=512 \
  --center_crop \
  --random_flip \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --max_train_steps=15000 \
  --learning_rate=1e-04 \
  --max_grad_norm=1 \
  --lr_scheduler="cosine" \
  --lr_warmup_steps=0 \
  --output_dir=${OUTPUT_DIR} \
  --checkpointing_steps=500 \
  --validation_prompt="A red sports motorcycle at the starting position of a racetrack." \
  --seed=1337

Logs

03/15/2024 00:25:48 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

{'timestep_spacing', 'variance_type', 'sample_max_value', 'prediction_type', 'rescale_betas_zero_snr', 'thresholding', 'dynamic_thresholding_ratio', 'clip_sample_range'} was not found in config. Values will be initialized to default values.
{'latents_std', 'scaling_factor', 'latents_mean', 'force_upcast'} was not found in config. Values will be initialized to default values.
{'time_cond_proj_dim', 'dropout', 'num_class_embeds', 'upcast_attention', 'use_linear_projection', 'resnet_skip_time_act', 'only_cross_attention', 'encoder_hid_dim', 'time_embedding_type', 'resnet_time_scale_shift', 'time_embedding_dim', 'encoder_hid_dim_type', 'num_attention_heads', 'addition_embed_type_num_heads', 'class_embed_type', 'addition_embed_type', 'mid_block_type', 'attention_type', 'resnet_out_scale_factor', 'mid_block_only_cross_attention', 'dual_cross_attention', 'conv_out_kernel', 'conv_in_kernel', 'timestep_post_act', 'reverse_transformer_layers_per_block', 'time_embedding_act_fn', 'projection_class_embeddings_input_dim', 'addition_time_embed_dim', 'transformer_layers_per_block', 'cross_attention_norm', 'class_embeddings_concat'} was not found in config. Values will be initialized to default values.
Resolving data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 944/944 [00:00<00:00, 846573.23it/s]
03/15/2024 00:25:58 - INFO - __main__ - ***** Running training *****
03/15/2024 00:25:58 - INFO - __main__ -   Num examples = 943
03/15/2024 00:25:58 - INFO - __main__ -   Num Epochs = 64
03/15/2024 00:25:58 - INFO - __main__ -   Instantaneous batch size per device = 1
03/15/2024 00:25:58 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 4
03/15/2024 00:25:58 - INFO - __main__ -   Gradient Accumulation steps = 4
03/15/2024 00:25:58 - INFO - __main__ -   Total optimization steps = 15000
Steps:   0%|                                                                                                                                                 | 0/15000 [00:02<?, ?it/s, lr=0.0001, step_loss=0.274]Traceback (most recent call last):
  File "~/sd2/diffusers/examples/text_to_image/train_text_to_image_lora.py", line 976, in <module>
    main()
  File "~/sd2/diffusers/examples/text_to_image/train_text_to_image_lora.py", line 803, in main
    accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
  File "~/sd2/sd2env/lib/python3.10/site-packages/accelerate/accelerator.py", line 2145, in clip_grad_norm_
    self.unscale_gradients()
  File "~/sd2/sd2env/lib/python3.10/site-packages/accelerate/accelerator.py", line 2095, in unscale_gradients
    self.scaler.unscale_(opt)
  File "~/sd2/sd2env/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 336, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "~/sd2/sd2env/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 258, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
Steps:   0%|                                                                                                                                                 | 0/15000 [00:02<?, ?it/s, lr=0.0001, step_loss=0.274]
Traceback (most recent call last):
  File "~/sd2/sd2env/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "~/sd2/sd2env/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "~/sd2/sd2env/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1057, in launch_command
    simple_launcher(args)
  File "~/sd2/sd2env/lib/python3.10/site-packages/accelerate/commands/launch.py", line 673, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['~/sd2/sd2env/bin/python3', 'train_text_to_image_lora.py', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--dataset_name=~/sdhf/training/set1', '--dataloader_num_workers=8', '--resolution=512', '--center_crop', '--random_flip', '--train_batch_size=1', '--gradient_accumulation_steps=4', '--max_train_steps=15000', '--learning_rate=1e-04', '--max_grad_norm=1', '--lr_scheduler=cosine', '--lr_warmup_steps=0', '--output_dir=./set1lora', '--checkpointing_steps=500', '--validation_prompt=A red sports motorcycle at the starting position of a racetrack.', '--seed=1337']' returned non-zero exit status 1.

System Info

diffusers version: 0.27.0.dev0 Platform: Linux-6.5.0-25-generic-x86_64-with-glibc2.35 OS: Ubuntu 22.04 Python version: 3.10.12 PyTorch version (GPU?): 2.2.1+cu121 (True) Huggingface_hub version: 0.21.4 Transformers version: 4.38.2 Accelerate version: 0.28.0 xFormers version: not installed Using GPU in script?: Nvidia RTX 3060 Laptop 6GB (GA106M) Using distributed or parallel set-up in script?: NO

Who can help?

@sayakpaul

sayakpaul commented 7 months ago

You need to fix. Refer here: https://github.com/huggingface/diffusers/issues/6552.

sayakpaul commented 7 months ago

Hi @HemalPatil. Did you get to try out the changes from https://github.com/huggingface/diffusers/issues/6552 in the concerned script?

HemalPatil commented 7 months ago

Well given that I'm a noob in AI and haven't found the time lately, no I haven't tried it out. Maybe over this weekend.

jcRisch commented 7 months ago

@HemalPatil , to make it work, you need to add --mixed_precision="fp16" in args of your script (train_text_to_image_lora.py). This was suggested in the following comment https://github.com/huggingface/diffusers/issues/6363#issuecomment-1870761866

Example:

accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py \
  --mixed_precision="fp16" \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$DATASET_NAME --caption_column="text" \
  --resolution=512 --random_flip \
  --train_batch_size=1 \
  --num_train_epochs=100 --checkpointing_steps=5000 \
  --learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
  --seed=42 \
  --output_dir=${OUTPUT_DIR} \
  --validation_prompt="a menacing skull with sunglasses." --report_to="wandb"

cc @sayakpaul

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Yifei-Wang28 commented 3 months ago

@jcRisch Many thanks, this works for me for this issue!