SD3 cannot achieve good fitting results on large data sets

leonary commented 1 week ago

I conducted several tests. DAdaptAdam achieved similar fitting results to SD15/SDXL on a dataset consisting of a single image. However, on datasets with hundreds of images, the fitting speed of DAdaptAdam significantly decreased. The figure below shows three loss reduction curves: a small dataset with a single image, a large dataset using Cosine With Restarts, and Cosine..

Not smoothed: Smoothed(Smooth strength 1):

When using conventional schedulers like Adam, the loss reduction curve is better than with DAdaptAdam. If the final loss with DAdaptAdam is 0.142, Adam can reach 0.135, but this loss is still somewhat high. I am currently testing higher learning rates to observe the trend of the loss, but I seem to observe that the loss is difficult to reduce on large datasets across multiple training scripts, resulting in poor learning performance. Could this be an issue with the SD3 model itself?

dill-shower commented 1 week ago

Are you finetune LoRA or full model?

leonary commented 1 week ago

Sd-scripts only supports full finetune by now

mliand commented 1 week ago

I think you should try to examine the data set

alittlebitfun commented 1 week ago

same

SatMa34 commented 1 week ago

hi dude, what's the training parameters you used, do you run into the increasing memory and caused crash finally? i tried to train sd3 use 50k data, and the memory is increasing all the time, finally after the first epoch, it reached the memory line and crashed

leonary commented 1 week ago

hi dude, what's the training parameters you used, do you run into the increasing memory and caused crash finally? i tried to train sd3 use 50k data, and the memory is increasing all the time, finally after the first epoch, it reached the memory line and crashed

accelerate launch --mixed_precision="bf16" --num_processes=1 --num_machines=1 --num_cpu_threads_per_process=2 "/root/sd3_train.py" --bucket_reso_steps=64 --caption_extension=".txt" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --learning_rate="3e-6" --loss_type="l2" --lr_scheduler="cosine" --lr_scheduler_num_cycles="1" --max_data_loader_n_workers="0" --max_grad_norm="1" --resolution="1024,1024" --max_train_steps="" --optimizer_type="AdamW" --output_name="last" --output_dir="/root/181" --pretrained_model_name_or_path="/root/SD3.safetensors" --save_every_n_epochs="1" --save_model_as=safetensors --save_precision="fp16" --train_batch_size="1" --train_data_dir="/root/181/full" --cache_text_encoder_outputs --cache_text_encoder_outputs_to_disk --cache_latents --cache_latents_to_disk --lr_warmup_steps=

Hi, my training was conducted on a 48GB graphics card with a batch size of 1, so I did not encounter any memory issues. You need to use BF16 precision and a batch size of 1. If you still experience memory overflow, you will need to switch to a graphics card with larger memory capacity.

Additionally, how is the trend of your training loss? Could you show me the trend of your loss decrease?

SatMa34 commented 1 week ago

hi dude, what's the training parameters you used, do you run into the increasing memory and caused crash finally? i tried to train sd3 use 50k data, and the memory is increasing all the time, finally after the first epoch, it reached the memory line and crashed

accelerate launch --mixed_precision="bf16" --num_processes=1 --num_machines=1 --num_cpu_threads_per_process=2 "/root/sd3_train.py" --bucket_reso_steps=64 --caption_extension=".txt" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --learning_rate="3e-6" --loss_type="l2" --lr_scheduler="cosine" --lr_scheduler_num_cycles="1" --max_data_loader_n_workers="0" --max_grad_norm="1" --resolution="1024,1024" --max_train_steps="" --optimizer_type="AdamW" --output_name="last" --output_dir="/root/181" --pretrained_model_name_or_path="/root/SD3.safetensors" --save_every_n_epochs="1" --save_model_as=safetensors --save_precision="fp16" --train_batch_size="1" --train_data_dir="/root/181/full" --cache_text_encoder_outputs --cache_text_encoder_outputs_to_disk --cache_latents --cache_latents_to_disk --lr_warmup_steps=

Hi, my training was conducted on a 48GB graphics card with a batch size of 1, so I did not encounter any memory issues. You need to use BF16 precision and a batch size of 1. If you still experience memory overflow, you will need to switch to a graphics card with larger memory capacity.

Additionally, how is the trend of your training loss? Could you show me the trend of your loss decrease?

oh, i mean the cpu memory not the gpu memory, i didn't run into the OOM error. As for the loss trend, because of the cpu memory problem, i can only train like 2000 steps, the training loss seems decrease normally (like from 0.22 to 0.17)

leonary commented 1 week ago

oh, i mean the cpu memory not the gpu memory, i didn't run into the OOM error. As for the loss trend, because of the cpu memory problem, i can only train like 2000 steps, the training loss seems decrease normally (like from 0.22 to 0.17)

My memory is 100GB, so I haven't encountered training failures due to memory increase. It sounds like an out-of-memory issue. You can provide detailed information to Kohya for his assessment. @kohya-ss

A loss of 0.22 is indeed too high, and even 0.17 is high. Such a decrease doesn't indicate much. I hope you can complete the training soon. I look forward to your results as they may help me, since I am currently unable to obtain sufficiently fitting results on smaller datasets.

kohya-ss commented 1 week ago

Several people on X (twitter) seem to have completed the training on a larger data set, but I will test with a large dataset tomorrow.

leonary commented 1 week ago

Thank you very much, I hope to see your test results as soon as possible.

kohya-ss commented 1 week ago

I fixed a bug the gradients are in bf16 in mixed precision training even without --full_bf16 option. This might improve the training result in mixed precision training.

leonary commented 1 week ago

I fixed a bug the gradients are in bf16 in mixed precision training even without --full_bf16 option. This might improve the training result in mixed precision training.

Thanks for the code update and follow-up on this question, I will complete a test training in a few hours. Although I don't think this will help much, because I have trained in fp32 precision before, and the underfitting phenomenon is also serious.

leonary commented 1 week ago

I fixed a bug the gradients are in bf16 in mixed precision training even without --full_bf16 option. This might improve the training result in mixed precision training.

I have completed the tests, and your updated code does not result in a greater loss reduction compared to the previous training with bf16 precision. I am curious about your tests with larger datasets. Have you encountered any issues with reduced learning effectiveness due to the larger dataset?

kohya-ss commented 1 week ago

I am testing on a medium sized dataset (20k) and am struggling with NaN. It does not occur on a very small toy dataset, so there must be something wrong. I will continue to investigate.

mliand commented 1 week ago

I've trained on a 16k dataset, but it seems that mixed precision training is problematic, and my average loss is in the 0.09 range

leonary commented 1 week ago

I've trained on a 16k dataset, but it seems that mixed precision training is problematic, and my average loss is in the 0.09 range

If you don't encounter NaN, kohya shouldn't encounter it either. Can you tell us the learning rate and repeat you used? Maybe kohya just used too high lr? Before today, mixed precision was problematic. Bf16 will cause --full_bf16 to be enabled before today. Also, what is the starting loss of your training?

mliand commented 1 week ago

我已经在 16k 数据集上进行了训练，但混合精度训练似乎存在问题，我的平均损失在 0.09 范围内

如果您没有遇到 NaN，kohya 也不应该遇到它。您能告诉我们您使用的学习率和重复率吗？也许 kohya 只是使用了太高的 lr？在今天之前，混合精度是有问题的。Bf16 将导致在今天之前启用 --full_bf16。另外，您的训练的起始损失是多少？ mixed_precision = "bf16" full_bf16 = false weighting_scheme = "logit_normal" sdpa = true optimizer_type = "AdamW" learning_rate = 1e-5 dataset_repeats = 1 max_train_epochs = 10

leonary commented 1 week ago

At around 2000 steps, the loss is 0.0932, and as you mentioned earlier, the loss remains at 0.09 by the end of the training. The loss curve indeed reflects this, as the loss doesn't seem to decrease significantly.

What is your dataset about? How do the images generated by the model look to you? Do those images represent your dataset well?

pjrpjr commented 1 week ago

My data set is 200 pictures but my loss value cannot be reduced very well。it from 0.103 to 0.113（after 19hours on4090）

leonary commented 1 week ago

My data set is 200 pictures but my loss value cannot be reduced very well。it from 0.103 to 0.113（after 19hours on4090）

Although I know the current training is difficult to fit, the loss value shouldn't be increasing. What repeat, optimizer, and learning rate are you using?

pjrpjr commented 1 week ago

My data set is 200 pictures but my loss value cannot be reduced very well。it from 0.103 to 0.113（after 19hours on4090）我的数据集是200张图片，但是我的损失值不能很好地降低。它从0.103到0.113（4090上19小时后）

Although I know the current training is difficult to fit, the loss value shouldn't be increasing. What repeat, optimizer, and learning rate are you using?虽然我知道目前的训练很难拟合，但损失值不应该增加。您使用什么重复率、优化器和学习率？

repeat:4 optimizer:PagedAdamW8bit learning rate:1e-5

pjrpjr commented 1 week ago

My data set is 200 pictures but my loss value cannot be reduced very well。it from 0.103 to 0.113（after 19hours on4090）

Although I know the current training is difficult to fit, the loss value shouldn't be increasing. What repeat, optimizer, and learning rate are you using?

Do you have some suggestions for fine tuning the sd3, thanks a lot

leonary commented 1 week ago

Do you have some suggestions for fine tuning the sd3, thanks a lot

I have never used PagedAdamW8bit, but 1e-5 is not considered high for AdamW series optimizers, so the learning rate is probably not the reason your final loss is increasing.

Assuming your epoch is 10, because I see many people like to use 10 epochs, then your single image learning times are 4x10=40 times. Assuming the batch size is 1, that's a total of 8000 steps. Completing 8000 steps on an A40 48GB GPU only takes 1.57 hours, while your training took 19 hours. Given that the 4090's computation speed is similar to the A40, is the high time consumption due to the 4090's lower memory capacity? This far exceeds the normal training time, so there is an issue.

Can you provide all the parameters you are using so I can help you further diagnose the problem? However, my current guess is that you should test using a GPU with 48GB of memory.

pjrpjr commented 1 week ago

$lr = "1e-5" $unet_lr = "6e-3" $text_encoder_lr = "2e-5" $lr_scheduler = "warmup_stable_decay" $max_train_epoches = 320

pjrpjr commented 1 week ago

6619010b000000001b00b4a7信君粥129631

The left is the original image, and the right is the result after 300 epochs.

leonary commented 1 week ago

The left is the original image, and the right is the result after 300 epochs.

Given the very high unet lr you're using, the training results look pretty good! It looks like a good fit, but your loss is going up? That's weird.

pjrpjr commented 1 week ago

Maybe I got it wrong, I'm a newbie

pjrpjr commented 1 week ago

The left is the original image, and the right is the result after 300 epochs.

Given the very high unet lr you're using, the training results look pretty good! It looks like a good fit, but your loss is going up? That's weird.

thank you!,but there is still some gap between this and the original image. I am looking for better training methods to start larger-scale fine-tuning.

leonary commented 1 week ago

The left is the original image, and the right is the result after 300 epochs. I initially thought that your learning rate was way too high and definitely wouldn't work, but after seeing your results, I am a bit hesitant.

However, here are some diagnostics based on conventional parameters. They can help you save a lot of training time and make the process more reasonable.

The learning rate (lr) for model training is divided into two parts: the Unet lr and the Text encoder lr. The lr settings will be overridden by these two. A unet lr of 6e-3 is ridiculously high, and 2e-5 is also an excessively high lr for the Text encoder. A max_train_epoches of 320 is also a ludicrously high total epoch count. Maybe PagedAdamW8bit would work on a 6e-3 lr? I haven't used it so I don't know.

Now you could try to set lr = 1e-5, unet_lr = "1e-5", text_encoder_lr = "1e-6", max_train_epoches = 20, and lr_scheduler = "cosine", optimizer = "AdamW8bit".

For the parameters mentioned above, the lr for the text encoder is usually 1%-100% of the unet lr. The unet lr should be consistent with the overall lr, and 1e-5 is a good starting point. As the amount of data increases, the lr needs to be decreased. The maximum training epochs and repeats determine how many times a single image is learned, and 50-300 times is a good range. Your previous 320x4=1280 times is too exaggerated, causing extreme overfitting of the model. However, due to your excessively high lr, the model did not overfit but instead resulted in the collapse of the unet. For SD15 or SDXL, such lr would result in loss=nan, but even though the SD3 loss did not equal nan, you can still observe that the severely blurred unet cannot produce reasonable images. The lr_scheduler can be set to cosine. cosine_with_restarts requires additional setup, so cosine is fine.

Due to issues with SD3 or current training scripts, using the settings I suggested above, although you will observe a normal decrease in loss, the underfitting will still be quite severe. Please wait for updates from kohya and other developers.

chongxian commented 5 days ago

I am testing on a medium sized dataset (20k) and am struggling with NaN. It does not occur on a very small toy dataset, so there must be something wrong. I will continue to investigate.

I use a small size dataset which just have 290 images,but the loss is NaN.I try to use the smaller dataset, the loss become normal

kohya-ss / sd-scripts

SD3 cannot achieve good fitting results on large data sets #1390