kohya-ss / sd-scripts

Apache License 2.0
5.33k stars 881 forks source link

SD3.5でloss_typeにScheduled Huber Lossを使用したらエラーが出力されて学習が止まる。 #1809

Closed waomodder closed 23 hours ago

waomodder commented 1 day ago

お忙しい中、続けてバグ案件を出して申し訳ありません。 今現在SD3.5で複数概念をうまく学習できるように色々と模索しているところではありますが、バグに遭遇したので報告させていただきます。

SD3.5で--loss_typeにsmooth_l1を指定したら、エラーが出て学習が止まりました。 以下はエラー箇所のログです。

D:\Lora_learning3\sd-scripts\venv\Lib\site-packages\transformers\models\clip\modeling_clip.py:480: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
Traceback (most recent call last):
  File "D:\Lora_learning3\sd-scripts\sd3_train_network.py", line 480, in <module>
    trainer.train(args)
  File "D:\Lora_learning3\sd-scripts\train_network.py", line 1209, in train
    loss = train_util.conditional_loss(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Lora_learning3\sd-scripts\library\train_util.py", line 5899, in conditional_loss
    huber_c = huber_c.view(-1, 1, 1, 1)
              ^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'view'
steps:   0%|                                                                                                                                                                           | 0/3000 [00:52<?, ?it/s]
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "D:\Lora_learning3\sd-scripts\venv\Scripts\accelerate.exe\__main__.py", line 7, in <module>
  File "D:\Lora_learning3\sd-scripts\venv\Lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main
    args.func(args)
  File "D:\Lora_learning3\sd-scripts\venv\Lib\site-packages\accelerate\commands\launch.py", line 1106, in launch_command
    simple_launcher(args)
  File "D:\Lora_learning3\sd-scripts\venv\Lib\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['D:\\Lora_learning3\\sd-scripts\\venv\\Scripts\\python.exe', 'sd3_train_network.py', '--network_module', 'networks.lora_sd3', '--config_file', 'D:\\Lora_learning\\Data\\SD3_command03.toml']' returned non-zero exit status 1.

学習時のToml

clip_l = "D:/ComfyUI_windows_portable/ComfyUI/models/clip/clip_l.safetensors"
clip_g = "D:/ComfyUI_windows_portable/ComfyUI/models/clip/clip_g.safetensors"
t5xxl = "D:/ComfyUI_windows_portable/ComfyUI/models/clip/t5xxl_fp16.safetensors"
pretrained_model_name_or_path = "D:/ComfyUI_windows_portable/ComfyUI/models/checkpoints/sd3.5_large.safetensors"
train_data_dir = "D:/Lora_learning/Data/asset/super_robot_diffusion_XL_V3/multi_Class_test"
sample_prompts = "D:/Lora_learning/Data/output/prompt.txt"
output_dir = "D:/Lora_learning/Data/output"
output_name = "srdmk3_MC_t11"
unet_lr = 2e-4
text_encoder_lr = [ 5e-5, 5e-5, 5e-5]
train_batch_size = 1
gradient_accumulation_steps = 4
max_train_epochs = 20
save_every_n_epochs = 1
save_every_n_steps = 1000
network_dim = 32
network_alpha = 16
context_attn_dim = 32
context_mlp_dim = 32
x_attn_dim = 32
x_mlp_dim = 32
context_mod_dim = 0
x_mod_dim = 0
apply_t5_attn_mask = true
bucket_no_upscale = true
bucket_reso_steps = 64
cache_latents = true
cache_latents_to_disk = true
cache_text_encoder_outputs = true
cache_text_encoder_outputs_to_disk = true
caption_extension = ".txt"
clip_skip = 0
dynamo_backend = "eager"
enable_bucket = true
fp8_base = true
gradient_checkpointing = true
guidance_scale = 1.0
highvram = true
huber_c = 0.1
huber_schedule = "snr"
ip_noise_gamma = 0.1
ip_noise_gamma_random_strength = true
loss_type = "smooth_l1"
lr_scheduler = "cosine"
lr_scheduler_args = []
lr_scheduler_num_cycles = 1
lr_scheduler_power = 1
lr_warmup_steps = 200
max_bucket_reso = 2048
max_data_loader_n_workers = 4
max_grad_norm = 0.01
max_timestep = 1000
max_token_length = 225
min_bucket_reso = 256
mixed_precision = "bf16"
network_args = ["train_t5xxl=False", "emb_dims=[0,0,0,0,0,0]", "train_block_indices=12-24,30-37"]
network_module = "networks.lora_sd3"
noise_offset = 0
noise_offset_type = "Original"
optimizer_args = ["weight_decay=0.01", "betas=0.9,0.999", "eps=0.000001",]
optimizer_type = "AdamW8bit"
persistent_data_loader_workers = 1
prior_loss_weight = 1
resolution = "1024,1024"
sample_sampler = "euler_a"
save_model_as = "safetensors"
save_precision = "fp16"
save_state_on_train_end = true
sdpa = true
seed = 42

また、同様のエラーはFluxでも起きているそうです。 https://github.com/Akegarasu/lora-scripts/issues/565

kohya-ss commented 1 day ago

現在のところ、Huber LossはSD3/FLUXではサポートされておりません。ちょうどPR #1808 をいただいていますので、そちらがマージされれば解消されるかと思います。よろしくお願いいたします。

waomodder commented 23 hours ago

Kohyaさんはやとちりでバグ報告して申し訳ありません・・。 まだ未実装項目だったのですね。 先々で実装されたらまた使用してみます。 ご迷惑おかけしました。