train_text_to_image_lora.py raise ValueError("Attempting to unscale FP16 gradients.")

billvsme commented 8 months ago

Describe the bug

When looking at the examples/text_to_image documentation, I experimented with the train_text_to_image_lora.py following the examples in the documentation. But I found that the run with raise ValueError("Attempting to unscale FP16 gradients.") error.

I found that the cause of the error may be related to this code. Here use args.mixed_precision to determine whether to convert Lora's parameters to float32, but args.mixed_precision default value is None, according to the example in README, the mixedprecision of accelerate is set, and it is not set args.mixed precision, so it causes "Attempting to unscale FP16 gradients." error. https://github.com/huggingface/diffusers/blob/1fff527702399165f09dd880be43cfd8b8bae472/examples/text_to_image/train_text_to_image_lora.py#L468-L472

It might be a better choice to change this to use accelerator.mixed_precision

Reproduction

cd diffusers/examples/text_to_image/

accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py \
  --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \
  --dataset_name="lambdalabs/pokemon-blip-captions" --caption_column="text" \
  --resolution=512 --random_flip \
  --train_batch_size=1 \
  --num_train_epochs=100 --checkpointing_steps=5000 \
  --learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
  --seed=42 \
  --output_dir="sd-pokemon-model-lora" \
  --validation_prompt="cute dragon creature"

Logs

Steps:   0%|                                          | 0/20900 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/content/diffusers/examples/text_to_image/train_text_to_image_lora.py", line 945, in <module>
Traceback (most recent call last):
  File "/content/diffusers/examples/text_to_image/train_text_to_image_lora.py", line 945, in <module>
    main()
  File "/content/diffusers/examples/text_to_image/train_text_to_image_lora.py", line 774, in main
    accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
  File "/home/billvsme/venv/train/lib/python3.10/site-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
    main()
  File "/content/diffusers/examples/text_to_image/train_text_to_image_lora.py", line 774, in main
Traceback (most recent call last):
  File "/content/diffusers/examples/text_to_image/train_text_to_image_lora.py", line 945, in <module>
    accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
  File "/home/billvsme/venv/train/lib/python3.10/site-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
    self.unscale_gradients()
  File "/home/billvsme/venv/train/lib/python3.10/site-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
    main()
  File "/content/diffusers/examples/text_to_image/train_text_to_image_lora.py", line 774, in main
    self.unscale_gradients()
    self.scaler.unscale_(opt)
  File "/home/billvsme/venv/train/lib/python3.10/site-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
  File "/home/billvsme/venv/train/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/home/billvsme/venv/train/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
    accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
  File "/home/billvsme/venv/train/lib/python3.10/site-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
    self.scaler.unscale_(opt)
  File "/home/billvsme/venv/train/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File "/home/billvsme/venv/train/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
    self.unscale_gradients()
  File "/home/billvsme/venv/train/lib/python3.10/site-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
    raise ValueError("Attempting to unscale FP16 gradients.")

System Info

diffusers version: 0.25.0.dev0
Platform: Linux-6.2.0-39-generic-x86_64-with-glibc2.35
Python version: 3.10.13
PyTorch version (GPU?): 2.1.2+cu121 (True)
Huggingface_hub version: 0.19.4
Transformers version: 4.36.2
Accelerate version: 0.25.0
xFormers version: 0.0.22.post7
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@sayakpaul

sayakpaul commented 8 months ago

A better way would be to assign args.mixed_precision from accelerator.mixed_precision.

However, when you initialize an Accelerator object you pass the value from args.mixed_precision itself:

https://github.com/huggingface/diffusers/blob/1fff527702399165f09dd880be43cfd8b8bae472/examples/text_to_image/train_text_to_image_lora.py#L385

So, passing mixed_precision to your CLI args is recommended.

billvsme commented 8 months ago

@sayakpaul 👌，thanks

But I found one that was different from train_text_to_image.py and train_text_to_image_lora.py, train_text_to_image_lora.py didn't reassign the args.mixed_precision. In this way, if you specify accelerate launch --mixed_precision="fp16" in the accelerator, you need to add the same --mixed_precision="fp16" to the CLI args . Only in this way will there be no error, like is

accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py \
  --mixed_precision="fp16" \
  ......

train_text_to_image.py: https://github.com/huggingface/diffusers/blob/1fff527702399165f09dd880be43cfd8b8bae472/examples/text_to_image/train_text_to_image.py#L811-L816

train_text_to_image_lora.py: https://github.com/huggingface/diffusers/blob/1fff527702399165f09dd880be43cfd8b8bae472/examples/text_to_image/train_text_to_image_lora.py#L444-L448

billvsme commented 8 months ago

Maybe the example in the docs needs to be updated

https://github.com/huggingface/diffusers/tree/main/examples/text_to_image

sayakpaul commented 8 months ago

Should be fixed with: https://github.com/huggingface/diffusers/issues/6388. Could you pull the changes and try again? :)

AfrinaVT commented 7 months ago

Hi @sayakpaul , The problem with running train_text_to_image_lora.py still persists for me. I have pulled the latest changes from the GitHub repo.

sayakpaul commented 7 months ago

Could you maybe refer to https://github.com/huggingface/diffusers/issues/6552 and open a PR?

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

yiyixuxu commented 6 months ago

can we close this one now?

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

blueclowd commented 2 months ago

I encountered the same issue on diffusers==0.30.0.dev0. The additional CLI args works on this version as well.

lino-levan commented 1 month ago

Just encountered this issue. Not stale.

huggingface / diffusers