kohya-ss / sd-scripts

Apache License 2.0
4.55k stars 774 forks source link

sd3 train,loss is nan #1407

Open chongxian opened 2 weeks ago

chongxian commented 2 weeks ago

image image when I use this command, the loss is nan,how to solve this problem? Thanks for your help the datasets is small,just 290 images, but loss is nan,I try to set the mixed_precision=bf16 and t5xxl_dtype =bf16,but these settings don't work ,the loss is also nan

mliand commented 2 weeks ago

t5xxl_dtype=bf16

chongxian commented 2 weeks ago

t5xxl_dtype=bf16

I try this setting,but it doesn't work

leonary commented 2 weeks ago

Your loss is equal to nan in the initial stage of training. This should be caused by fp16 precision. Set mixed_precision=bf16, and then do not declare t5xxl_dtype.

chongxian commented 2 weeks ago

Your loss is equal to nan in the initial stage of training. This should be caused by fp16 precision. Set mixed_precision=bf16, and then do not declare t5xxl_dtype.

image It doesn't work ,the loss is nan

chongxian commented 2 weeks ago

I solve the problem now,but this problem may be the bug of train code

kohya-ss commented 2 weeks ago

Please remove *_sd3_te.npz files in the training directory, when changing the mixed precision or t5xxl_dtype. It recreates cache files.