Closed shangvo closed 10 months ago
Can you try with the latest script from main
with peft
installed?
https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py
I ran into the same issue(but on sdxl and making a lora with dreambooth) and had some luck by switching back to the prior commit(dadd55fb36acc862254cf935826d54349b0fcd8c).
Same error with peft installed for me:
12/07/2023 18:49:55 - INFO - main - Running training
12/07/2023 18:49:55 - INFO - main - Num examples = 5
12/07/2023 18:49:55 - INFO - main - Num batches each epoch = 5
12/07/2023 18:49:55 - INFO - main - Num Epochs = 134
12/07/2023 18:49:55 - INFO - main - Instantaneous batch size per device = 1
12/07/2023 18:49:55 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 2
12/07/2023 18:49:55 - INFO - main - Gradient Accumulation steps = 2
12/07/2023 18:49:55 - INFO - main - Total optimization steps = 400
Steps: 0%| | 0/400 [00:01<?, ?it/s, loss=0.036, lr=0.0001]Traceback (most recent call last):
File "/root/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1716, in
I ran into the same issue(but on sdxl and making a lora with dreambooth) and had some luck by switching back to the prior commit(dadd55f).
Is this the right way to install that version? I have the same issue there.
• git clone https://github.com/huggingface/diffusers.git
• cd diffusers
• git checkout dadd55f
@bekkblando @bluusun
Can you provide a Colab Notebook that reproduces this error? I have been using these scripts recently and haven't seen any such thing happening TBH.
Went back to git checkout https://github.com/huggingface/diffusers/commit/dadd55fb36acc862254cf935826d54349b0fcd8c and error is gone, so must be from last release :)
Here is my script parameters:
export VAE_PATH="madebyollin/sdxl-vae-fp16-fix" export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
accelerate launch train_dreambooth_lora_sdxl.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --instance_data_dir="/root/input" \ --pretrained_vae_model_name_or_path=$VAE_PATH \ --output_dir="/root/model" \ --instance_prompt="$i_prompt" \ --resolution=1024 \ --train_batch_size=1 \ --gradient_accumulation_steps=2 \ --learning_rate=1e-4\ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --max_train_steps="$max_model_steps" \ --seed="0" \ --gradient_checkpointing \ --mixed_precision="fp16" \ --checkpointing_steps=1000000
This may be related to: https://github.com/huggingface/diffusers/issues/5368#issuecomment-1812060060
Thanks for looking into this!
Oh and when not using mixed_precision I get this error using latest release (works in prior releases):
Loading unet.
Traceback (most recent call last):
File "/root/diffusers/examples/dreambooth/train_dreambooth_lora_sdxl.py", line 1716, in
@sayakpaul Here ya go: https://colab.research.google.com/drive/1gTMH81B7yojA1mZAkl88fwRB_OQ9c1Zv?usp=sharing
When I change !wget https://raw.githubusercontent.com/huggingface/diffusers/main/examples/dreambooth/train_dreambooth_lora_sdxl.py
to !wget https://raw.githubusercontent.com/huggingface/diffusers/dadd55fb36acc862254cf935826d54349b0fcd8c/examples/dreambooth/train_dreambooth_lora_sdxl.py
it works. Also notice I did a patch for mixed precision, not sure if its related to this problem https://github.com/huggingface/diffusers/issues/5368#issuecomment-1792815008.
Let me know if I'm doing anything wrong, thanks!
Today, I‘m sorry, i try it again and use the latest script today. The question still exists. Please help me check the colab link: https://colab.research.google.com/drive/1K3MPyS2u3s4MSzNuRGxK7jByrh3oR2ke?usp=sharing This link cannot be opened directly, but can be accessed by copying it into the browser.
ValueError: Attempting to unscale FP16 gradients. Steps: 0% 0/15000 [00:03<?, ?it/s, lr=0.0001, step_loss=0.126]
I used the latest scirpt [train_text_to_image_lora.py] @sayakpaul @bekkblando
I can reproduce this error: https://colab.research.google.com/gist/sayakpaul/e4dfe903bf945c039a513f3f9145e3ee/scratchpad.ipynb
The problem persists even when using a source installation of accelerate
.
Ccing @muellerzr @SunMarc here.
Also cc @younesbelkada for awareness.
FOund a fix, will open a PR.
Hi, I am also encountering this problem when the loss back propagates.
https://github.com/huggingface/diffusers/pull/6119 should fix it.
This is a continued error till now.
Also the datasets creation, does not take data_files as any input and this is lacking the caption file if passed as metadata.jsonl.
Here in this code in "train_dreambooth_lora_sdxl.py "
dataset = load_dataset(
args.dataset_name,
args.dataset_config_name,
cache_dir=args.cache_dir,
)
It shows the error:
ValueError: `--caption_column` value 'text' not found in dataset columns. Dataset columns are: image.
@sayakpaul can you check this as I tried you latest fix branch code too, it fails here giving the above error and that is because datasets library is unable to get the metadata.jsonl file.
Correct me if I am taking it wrong.
Tried branch: fix/lora-training
@sayakpaul @bekkblando
The recent PR fixes it.
Could not get past this issue.
I used the same colab for this. It is because the dataset library when using load_dataset() is not provided with caption file.
@haofanwang do let me know if this is not an error for you. how did you get passed this? Can you share you colab link?
I ran into the same issue(but on sdxl and making a lora with dreambooth) and had some luck by switching back to the prior commit(dadd55f).
This fixed it.
Make sure to also uninstall peft
, otherwise it raises "AttributeError: 'Linear' object has no attribute 'set_lora_layer'"
.
6119 should fix it.
I ran the script on 4 * A800 GPU, PyTorch 2.1.1 and CUDA 12.1, and it produced the following error in DiffusionPipeline:
RuntimeError: Input type (c10::Half) and bias type (float) should be the same.
It seems the reason is consistent with what was pointed out in #4796, which modified SDXL pipelines so that vae.dtype
and latents.dtype
can match. It works for me to change Line 861 to StableDiffusionPipeline and modify pipeline_stable_diffusion.py Line 957-978 as StableDiffusionXLPipeline
if not output_type == "latent":
# image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False, generator=generator)[
# 0
# ]
# ==== script from StableDiffusionXLPipeline ====
# make sure the VAE is in float32 mode, as it overflows in float16
needs_upcasting = self.vae.dtype == torch.float16 and self.vae.config.force_upcast
if needs_upcasting:
# self.upcast_vae()
latents = latents.to(next(iter(self.vae.post_quant_conv.parameters())).dtype)
image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False, generator=generator)[0]
# cast back to fp16 if needed
# if needs_upcasting:
# self.vae.to(dtype=torch.float16)
# ==== script from StableDiffusionXLPipeline ====
image, has_nsfw_concept = self.run_safety_checker(image, device, prompt_embeds.dtype)
else:
image = latents
has_nsfw_concept = None
I'm not sure if this error can be addressed without modification in source code.
train_text_to_image_lora_sdxl
has the same error, and potential error: "Input type (c10::Half) and bias type (float) should be the same" may arise during the resolution process.
I will fix it.
FYI, this error is still occuring for me when using the examples/advanced/train_dreambooth_lora_sdxl_advanced.py
script, but only when restarting from a checkpoint, not when training from scratch.
Traceback (most recent call last):
File ".../train_dreambooth_lora_sdxl_advanced.py", line 2111, in <module>
main(args)
File ".../train_dreambooth_lora_sdxl_advanced.py", line 1866, in main
accelerator.clip_grad_norm_(params_to_clip, args.max_grad_norm)
File ".../miniconda3/envs/sdxl2/lib/python3.10/site-packages/accelerate/accelerator.py", line 2040, in clip_grad_norm_
self.unscale_gradients()
File ".../miniconda3/envs/sdxl2/lib/python3.10/site-packages/accelerate/accelerator.py", line 2003, in unscale_gradients
self.scaler.unscale_(opt)
File ".../miniconda3/envs/sdxl2/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 307, in unscale_
optimizer_state["found_inf_per_device"] = self._unscale_grads_(
File ".../miniconda3/envs/sdxl2/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscale_grads_
raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
I'm on Ubuntu 22, using an RTX 3090, and the latest commit (9d945b2)
That is a separate script and you should report a separate issue for that :-)
Please tag @linoytsaban there.
Describe the bug
12/07/2023 07:37:24 - INFO - main - Running training 12/07/2023 07:37:24 - INFO - main - Num examples = 833 12/07/2023 07:37:24 - INFO - main - Num Epochs = 72 12/07/2023 07:37:24 - INFO - main - Instantaneous batch size per device = 1 12/07/2023 07:37:24 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 4 12/07/2023 07:37:24 - INFO - main - Gradient Accumulation steps = 4 12/07/2023 07:37:24 - INFO - main - Total optimization steps = 15000 Steps: 0% 0/15000 [00:03<?, ?it/s, lr=0.0001, step_loss=0.126] Traceback (most recent call last): File "/content/diffusers/examples/text_to_image/train_text_to_image_lora.py", line 960, in
main()
File "/content/diffusers/examples/text_to_image/train_text_to_image_lora.py", line 798, in main
accelerator.clip_gradnorm(params_to_clip, args.max_grad_norm)
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2040, in clip_gradnorm
self.unscale_gradients()
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2003, in unscalegradients
self.scaler.unscale(opt)
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/gradscaler.py", line 307, in unscale
optimizer_state["found_inf_per_device"] = self._unscalegrads(
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/amp/grad_scaler.py", line 229, in _unscalegrads
raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.
Steps: 0% 0/15000 [00:03<?, ?it/s, lr=0.0001, step_loss=0.126]
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1017, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 637, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train_text_to_image_lora.py', '--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5', '--dataset_name=lambdalabs/pokemon-blip-captions', '--dataloader_num_workers=8', '--resolution=512', '--center_crop', '--random_flip', '--train_batch_size=1', '--gradient_accumulation_steps=4', '--max_train_steps=15000', '--learning_rate=1e-04', '--max_grad_norm=1', '--lr_scheduler=cosine', '--lr_warmup_steps=0', '--output_dir=/sddata/finetune/lora/pokemon', '--push_to_hub', '--hub_model_id=pokemon-lora', '--report_to=wandb', '--checkpointing_steps=500', '--validation_prompt=A pokemon with blue eyes.', '--seed=1337']' returned non-zero exit status 1.
Reproduction
!git clone https://github.com/huggingface/diffusers %cd diffusers !pip install . %cd examples/text_to_image !pip install -r requirements.txt !accelerate config default !pip install huggingface_hub wandb
from huggingface_hub import HfFolder, login
使用 Hugging Face 的 API 密钥登录
login(token='hf_tlt---------BRqMBjwdi')
设置 WandB 的 API 密钥
import wandb wandb.login(key='b6a210-------------7f543c')
运行训练脚本
!accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py \ --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" \ --dataset_name="lambdalabs/pokemon-blip-captions" \ --dataloader_num_workers=8 \ --resolution=512 \ --center_crop \ --random_flip \ --train_batch_size=1 \ --gradient_accumulation_steps=4 \ --max_train_steps=15000 \ --learning_rate=1e-04 \ --max_grad_norm=1 \ --lr_scheduler="cosine" \ --lr_warmup_steps=0 \ --output_dir="/sddata/finetune/lora/pokemon" \ --push_to_hub \ --hub_model_id="pokemon-lora" \ --report_to=wandb \ --checkpointing_steps=500 \ --validation_prompt="A pokemon with blue eyes." \ --seed=1337
Logs
System Info
processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 79 model name : Intel(R) Xeon(R) CPU @ 2.20GHz stepping : 0 microcode : 0xffffffff cpu MHz : 2199.998 cache size : 56320 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa mmio_stale_data retbleed bogomips : 4399.99 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management:
Who can help?
@sayakpaul @patrickvonplaten