shangvo commented 10 months ago

Describe the bug

12/07/2023 07:37:24 - INFO - main - Running training 12/07/2023 07:37:24 - INFO - main - Num examples = 833 12/07/2023 07:37:24 - INFO - main - Num Epochs = 72 12/07/2023 07:37:24 - INFO - main - Instantaneous batch size per device = 1 12/07/2023 07:37:24 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 4 12/07/2023 07:37:24 - INFO - main - Gradient Accumulation steps = 4 12/07/2023 07:37:24 - INFO - main - Total optimization steps = 15000 Steps: 0% 0/15000 [00:03<?, ?it/s, lr=0.0001, step_loss=0.126] Traceback (most recent call last): File "/content/diffusers/examples/text_to_image/train_text_to_image_lora.py", line 960, in main() File "/content/diffusers/examples/text_to_image/train_text_to_image_lora.py", line 798, in main accelerator.clip_gradnorm(params_to_clip, args.max_grad_norm) File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2040, in clip_gradnorm self.unscale_gradients() File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2003, in unscalegradients self.scaler.unscale(opt) File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/gradscaler.py", line 307, in unscale optimizer_state["found_inf_per_device"] = self._unscalegrads( File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscalegrads raise ValueError("Attempting to unscale FP16 gradients.") ValueError: Attempting to unscale FP16 gradients. Steps: 0% 0/15000 [00:03<?, ?it/s, lr=0.0001, step_loss=0.126] Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1017, in launch_command simple_launcher(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train_text_to_image_lora.py', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--dataset_name=lambdalabs/pokemon-blip-captions', '--dataloader_num_workers=8', '--resolution=512', '--center_crop', '--random_flip', '--train_batch_size=1', '--gradient_accumulation_steps=4', '--max_train_steps=15000', '--learning_rate=1e-04', '--max_grad_norm=1', '--lr_scheduler=cosine', '--lr_warmup_steps=0', '--output_dir=/sddata/finetune/lora/pokemon', '--push_to_hub', '--hub_model_id=pokemon-lora', '--report_to=wandb', '--checkpointing_steps=500', '--validation_prompt=A pokemon with blue eyes.', '--seed=1337']' returned non-zero exit status 1.

Reproduction

!git clone https://github.com/huggingface/diffusers %cd diffusers !pip install . %cd examples/text_to_image !pip install -r requirements.txt !accelerate config default !pip install huggingface_hub wandb

from huggingface_hub import HfFolder, login

使用 Hugging Face 的 API 密钥登录

login(token='hf_tlt---------BRqMBjwdi')

设置 WandB 的 API 密钥

import wandb wandb.login(key='b6a210-------------7f543c')

运行训练脚本

!accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py \ --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \ --dataset_name="lambdalabs/pokemon-blip-captions" \ --dataloader_num_workers=8 \ --resolution=512 \ --center_crop \ --random_flip \ --train_batch_size=1 \ --gradient_accumulation_steps=4 \ --max_train_steps=15000 \ --learning_rate=1e-04 \ --max_grad_norm=1 \ --lr_scheduler="cosine" \ --lr_warmup_steps=0 \ --output_dir="/sddata/finetune/lora/pokemon" \ --push_to_hub \ --hub_model_id="pokemon-lora" \ --report_to=wandb \ --checkpointing_steps=500 \ --validation_prompt="A pokemon with blue eyes." \ --seed=1337

Logs

|Timestamp|Level|Message|
|---|---|---|
|Dec 7, 2023, 3:42:20 PM|INFO|Kernel started: 27fdce74-a69a-40c5-989e-8877ec3aa3d0, name: python3|
|Dec 7, 2023, 3:42:07 PM|INFO|Use Control-C to stop this server and shut down all kernels \(twice to skip confirmation\)\.|
|Dec 7, 2023, 3:42:07 PM|INFO|http://172\.28\.0\.2:9000/|
|Dec 7, 2023, 3:42:07 PM|INFO|Jupyter Notebook 6\.5\.5 is running at:|
|Dec 7, 2023, 3:42:07 PM|INFO|Serving notebooks from local directory: /|
|Dec 7, 2023, 3:42:07 PM|INFO|Use Control-C to stop this server and shut down all kernels \(twice to skip confirmation\)\.|
|Dec 7, 2023, 3:42:07 PM|INFO|http://172\.28\.0\.12:9000/|
|Dec 7, 2023, 3:42:07 PM|INFO|Jupyter Notebook 6\.5\.5 is running at:|
|Dec 7, 2023, 3:42:07 PM|INFO|Serving notebooks from local directory: /|
|Dec 7, 2023, 3:42:04 PM|INFO|google\.colab serverextension initialized\.|
|Dec 7, 2023, 3:42:04 PM|INFO|Authentication of /metrics is OFF, since other authentication is disabled\.|
|Dec 7, 2023, 3:42:04 PM|INFO|Writing notebook server cookie secret to /root/\.local/share/jupyter/runtime/notebook\_cookie\_secret|
|Dec 7, 2023, 3:42:04 PM|WARNING|       /root/\.jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING|       /root/\.local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING|       /usr/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING|       /usr/local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING|       /usr/local/etc/jupyter/jupyter\_notebook\_config\.d/panel-client-jupyter\.json|
|Dec 7, 2023, 3:42:04 PM|WARNING|       /etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|INFO|google\.colab serverextension initialized\.|
|Dec 7, 2023, 3:42:03 PM|INFO|Authentication of /metrics is OFF, since other authentication is disabled\.|
|Dec 7, 2023, 3:42:03 PM|INFO|Writing notebook server cookie secret to /root/\.local/share/jupyter/runtime/notebook\_cookie\_secret|
|Dec 7, 2023, 3:42:03 PM|WARNING|       /root/\.jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING|       /root/\.local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING|       /usr/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING|       /usr/local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING|       /usr/local/etc/jupyter/jupyter\_notebook\_config\.d/panel-client-jupyter\.json|
|Dec 7, 2023, 3:42:03 PM|WARNING|       /etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.975 NotebookApp\] Loaded config file: /root/\.jupyter/jupyter\_notebook\_config\.py|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.974 NotebookApp\] Looking for jupyter\_notebook\_config in /root/\.jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.974 NotebookApp\] Looking for jupyter\_notebook\_config in /root/\.local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.974 NotebookApp\] Looking for jupyter\_notebook\_config in /usr/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.973 NotebookApp\] Loaded config file: /usr/local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.973 NotebookApp\] Looking for jupyter\_notebook\_config in /usr/local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.973 NotebookApp\] Loaded config file: /etc/jupyter/jupyter\_notebook\_config\.py|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.972 NotebookApp\] Looking for jupyter\_notebook\_config in /etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.972 NotebookApp\] Looking for jupyter\_config in /root/\.jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.971 NotebookApp\] Looking for jupyter\_config in /root/\.local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.971 NotebookApp\] Looking for jupyter\_config in /usr/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.970 NotebookApp\] Looking for jupyter\_config in /usr/local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.970 NotebookApp\] Looking for jupyter\_config in /etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.970 NotebookApp\] Searching \['/root/\.jupyter', '/root/\.local/etc/jupyter', '/usr/etc/jupyter', '/usr/local/etc/jupyter', '/etc/jupyter'\] for config files|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.899 NotebookApp\] Loaded config file: /root/\.jupyter/jupyter\_notebook\_config\.py|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.894 NotebookApp\] Looking for jupyter\_notebook\_config in /root/\.jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.894 NotebookApp\] Looking for jupyter\_notebook\_config in /root/\.local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.894 NotebookApp\] Looking for jupyter\_notebook\_config in /usr/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.894 NotebookApp\] Loaded config file: /usr/local/etc/jupyter/jupyter\_notebook\_config\.json|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.890 NotebookApp\] Looking for jupyter\_notebook\_config in /usr/local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.890 NotebookApp\] Loaded config file: /etc/jupyter/jupyter\_notebook\_config\.py|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.881 NotebookApp\] Looking for jupyter\_notebook\_config in /etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.880 NotebookApp\] Looking for jupyter\_config in /root/\.jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.880 NotebookApp\] Looking for jupyter\_config in /root/\.local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.880 NotebookApp\] Looking for jupyter\_config in /usr/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.877 NotebookApp\] Looking for jupyter\_config in /usr/local/etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.872 NotebookApp\] Looking for jupyter\_config in /etc/jupyter|
|Dec 7, 2023, 3:42:02 PM|WARNING|\[D 07:42:02\.861 NotebookApp\] Searching \['/root/\.jupyter', '/root/\.local/etc/jupyter', '/usr/etc/jupyter', '/usr/local/etc/jupyter', '/etc/jupyter'\] for config files|

System Info

processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 79 model name : Intel(R) Xeon(R) CPU @ 2.20GHz stepping : 0 microcode : 0xffffffff cpu MHz : 2199.998 cache size : 56320 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa mmio_stale_data retbleed bogomips : 4399.99 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management:

Who can help?

@sayakpaul @patrickvonplaten

sayakpaul commented 10 months ago

Can you try with the latest script from main with peft installed? https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py

bekkblando commented 10 months ago

I ran into the same issue(but on sdxl and making a lora with dreambooth) and had some luck by switching back to the prior commit(dadd55fb36acc862254cf935826d54349b0fcd8c).

bluusun commented 10 months ago

Same error with peft installed for me:

12/07/2023 18:49:55 - INFO - main - Running training 12/07/2023 18:49:55 - INFO - main - Num examples = 5 12/07/2023 18:49:55 - INFO - main - Num batches each epoch = 5 12/07/2023 18:49:55 - INFO - main - Num Epochs = 134 12/07/2023 18:49:55 - INFO - main - Instantaneous batch size per device = 1 12/07/2023 18:49:55 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 2 12/07/2023 18:49:55 - INFO - main - Gradient Accumulation steps = 2 12/07/2023 18:49:55 - INFO - main - Total optimization steps = 400 Steps: 0%| | 0/400 [00:01<?, ?it/s, loss=0.036, lr=0.0001]Traceback (most recent call last): File "/root/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1716, in main(args) File "/root/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1494, in main accelerator.clip_gradnorm(params_to_clip, args.max_grad_norm) File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2040, in clip_gradnorm self.unscale_gradients() File "/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py", line 2003, in unscalegradients self.scaler.unscale(opt) File "/usr/local/lib/python3.10/site-packages/torch/cuda/amp/gradscaler.py", line 282, in unscale optimizer_state["found_inf_per_device"] = self._unscalegrads(optimizer, inv_scale, found_inf, False) File "/usr/local/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 210, in _unscalegrads raise ValueError("Attempting to unscale FP16 gradients.") ValueError: Attempting to unscale FP16 gradients. Steps: 0%| | 0/400 [00:03<?, ?it/s, loss=0.036, lr=0.0001] Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/usr/local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1017, in launch_command simple_launcher(args) File "/usr/local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/local/bin/python', 'train_dreambooth_lora_sdxl.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--instance_data_dir=/root/input', '--pretrained_vae_model_name_or_path=madebyollin/sdxl-vae-fp16-fix', '--output_dir=/root/model', '--instance_prompt=photo of tamatamatama person', '--resolution=1024', '--train_batch_size=1', '--gradient_accumulation_steps=2', '--learning_rate=1e-4', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_train_steps=400', '--seed=0', '--gradient_checkpointing', '--mixed_precision=fp16', '--checkpointing_steps=1000000']' returned non-zero exit status 1.

bluusun commented 10 months ago

I ran into the same issue(but on sdxl and making a lora with dreambooth) and had some luck by switching back to the prior commit(dadd55f).

Is this the right way to install that version? I have the same issue there.

• git clone https://github.com/huggingface/diffusers.git

• cd diffusers

• git checkout dadd55f

sayakpaul commented 10 months ago

@bekkblando @bluusun

Can you provide a Colab Notebook that reproduces this error? I have been using these scripts recently and haven't seen any such thing happening TBH.

bluusun commented 10 months ago

Went back to git checkout https://github.com/huggingface/diffusers/commit/dadd55fb36acc862254cf935826d54349b0fcd8c and error is gone, so must be from last release :)

Here is my script parameters:

export VAE_PATH="madebyollin/sdxl-vae-fp16-fix" export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"

accelerate launch train_dreambooth_lora_sdxl.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --instance_data_dir="/root/input" \ --pretrained_vae_model_name_or_path=$VAE_PATH \ --output_dir="/root/model" \ --instance_prompt="$i_prompt" \ --resolution=1024 \ --train_batch_size=1 \ --gradient_accumulation_steps=2 \ --learning_rate=1e-4\ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --max_train_steps="$max_model_steps" \ --seed="0" \ --gradient_checkpointing \ --mixed_precision="fp16" \ --checkpointing_steps=1000000

Thanks for looking into this!

bluusun commented 10 months ago

Oh and when not using mixed_precision I get this error using latest release (works in prior releases):

Loading unet. Traceback (most recent call last): File "/root/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1716, in main(args) File "/root/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1667, in main pipeline.load_lora_weights(args.output_dir) File "/root/diffusers/src/diffusers/loaders/lora.py", line 1305, in load_lora_weights self.load_lora_into_unet( File "/root/diffusers/src/diffusers/loaders/lora.py", line 468, in load_lora_into_unet unet.load_attn_procs( File "/usr/local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 118, in _inner_fn return fn(*args, kwargs) File "/root/diffusers/src/diffusers/loaders/unet.py", line 264, in load_attn_procs rank = value_dict["lora.down.weight"].shape[0] KeyError: 'lora.down.weight'**

bekkblando commented 10 months ago

@sayakpaul Here ya go: https://colab.research.google.com/drive/1gTMH81B7yojA1mZAkl88fwRB_OQ9c1Zv?usp=sharing

When I change !wget https://raw.githubusercontent.com/huggingface/diffusers/main/examples/dreambooth/train_dreambooth_lora_sdxl.py to !wget https://raw.githubusercontent.com/huggingface/diffusers/dadd55fb36acc862254cf935826d54349b0fcd8c/examples/dreambooth/train_dreambooth_lora_sdxl.py it works. Also notice I did a patch for mixed precision, not sure if its related to this problem https://github.com/huggingface/diffusers/issues/5368#issuecomment-1792815008.

Let me know if I'm doing anything wrong, thanks!

shangvo commented 10 months ago

Today, I‘m sorry, i try it again and use the latest script today. The question still exists. Please help me check the colab link: https://colab.research.google.com/drive/1K3MPyS2u3s4MSzNuRGxK7jByrh3oR2ke?usp=sharing This link cannot be opened directly, but can be accessed by copying it into the browser.

ValueError: Attempting to unscale FP16 gradients. Steps: 0% 0/15000 [00:03<?, ?it/s, lr=0.0001, step_loss=0.126]

I used the latest scirpt [train_text_to_image_lora.py] @sayakpaul @bekkblando

sayakpaul commented 10 months ago

I can reproduce this error: https://colab.research.google.com/gist/sayakpaul/e4dfe903bf945c039a513f3f9145e3ee/scratchpad.ipynb

The problem persists even when using a source installation of accelerate.

Ccing @muellerzr @SunMarc here.

Also cc @younesbelkada for awareness.

sayakpaul commented 10 months ago

FOund a fix, will open a PR.

ZichengDuan commented 10 months ago

Hi, I am also encountering this problem when the loss back propagates.

sayakpaul commented 10 months ago

https://github.com/huggingface/diffusers/pull/6119 should fix it.

yashveer08 commented 10 months ago

This is a continued error till now.

Also the datasets creation, does not take data_files as any input and this is lacking the caption file if passed as metadata.jsonl.

Here in this code in "train_dreambooth_lora_sdxl.py "

dataset = load_dataset(
            args.dataset_name,
            args.dataset_config_name,
            cache_dir=args.cache_dir,
        )

        It shows the error: 

        ValueError: `--caption_column` value 'text' not found in dataset columns. Dataset columns are: image.

        @sayakpaul  can you check this as I tried you latest fix branch code too, it fails here giving the above error and that is because datasets library is unable to get the metadata.jsonl file.

        Correct me if I am taking it wrong.

        Tried branch: fix/lora-training
        @sayakpaul @bekkblando

haofanwang commented 10 months ago

The recent PR fixes it.

yashveer08 commented 10 months ago

Could not get past this issue.

I used the same colab for this. It is because the dataset library when using load_dataset() is not provided with caption file.

@haofanwang do let me know if this is not an error for you. how did you get passed this? Can you share you colab link?

yaneq commented 10 months ago

I ran into the same issue(but on sdxl and making a lora with dreambooth) and had some luck by switching back to the prior commit(dadd55f).

This fixed it.

Make sure to also uninstall peft, otherwise it raises "AttributeError: 'Linear' object has no attribute 'set_lora_layer'".

vitis-melongena commented 10 months ago

6119 should fix it.

I ran the script on 4 * A800 GPU, PyTorch 2.1.1 and CUDA 12.1, and it produced the following error in DiffusionPipeline:

RuntimeError: Input type (c10::Half) and bias type (float) should be the same.

It seems the reason is consistent with what was pointed out in #4796, which modified SDXL pipelines so that vae.dtype and latents.dtype can match. It works for me to change Line 861 to StableDiffusionPipeline and modify pipeline_stable_diffusion.py Line 957-978 as StableDiffusionXLPipeline

        if not output_type == "latent":
            # image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False, generator=generator)[
            #     0
            # ]
            # ==== script from StableDiffusionXLPipeline  ====
            # make sure the VAE is in float32 mode, as it overflows in float16
            needs_upcasting = self.vae.dtype == torch.float16 and self.vae.config.force_upcast

            if needs_upcasting:
                # self.upcast_vae()
                latents = latents.to(next(iter(self.vae.post_quant_conv.parameters())).dtype)

            image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False, generator=generator)[0]

            # cast back to fp16 if needed
            # if needs_upcasting:
            #     self.vae.to(dtype=torch.float16)
            # ==== script from StableDiffusionXLPipeline  ====

            image, has_nsfw_concept = self.run_safety_checker(image, device, prompt_embeds.dtype)
        else:
            image = latents
            has_nsfw_concept = None

I'm not sure if this error can be addressed without modification in source code.

lvzii commented 10 months ago

train_text_to_image_lora_sdxl has the same error, and potential error: "Input type (c10::Half) and bias type (float) should be the same" may arise during the resolution process. I will fix it.

verrannt commented 9 months ago

FYI, this error is still occuring for me when using the examples/advanced/train_dreambooth_lora_sdxl_advanced.py script, but only when restarting from a checkpoint, not when training from scratch.

Traceback (most recent call last):
  File ".../train_dreambooth_lora_sdxl_advanced.py", line 2111, in <module>
    main(args)
  File ".../train_dreambooth_lora_sdxl_advanced.py", line 1866, in main
    accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
  File ".../miniconda3/envs/sdxl2/lib/python3.10/site-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
    self.unscale_gradients()
  File ".../miniconda3/envs/sdxl2/lib/python3.10/site-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
    self.scaler.unscale_(opt)
  File ".../miniconda3/envs/sdxl2/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
  File ".../miniconda3/envs/sdxl2/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.

I'm on Ubuntu 22, using an RTX 3090, and the latest commit (9d945b2)

sayakpaul commented 9 months ago

That is a separate script and you should report a separate issue for that :-)

Please tag @linoytsaban there.

huggingface / diffusers

In training the script train_text_to_image_lora.py on Colab with a V100 GPU, the error ValueError: Attempting to unscale FP16 gradients occurred. #6086

Describe the bug

Reproduction

使用 Hugging Face 的 API 密钥登录

设置 WandB 的 API 密钥

运行训练脚本

Logs

System Info

Who can help?

6119 should fix it.