Training slows down considerably

I'm running the SDXL LoRA training script on a dataset of ~50,000 images and I have noticed that training slows down considerably after several thousand training steps.

I have observed this behaviour before, however it was never an issue previously as I could get through multiple epochs in a couple of thousand training steps.

training starts at 3.16s/it and begins slowing down around step 6000, I have observed so far that it slows down to 5.32s/it however I have also observed the training script completely halting after sufficient time.

I am using a gradient accumulation of 5 for the training, a rank 8 LoRA, and a network alpha of 4, the target resolution is 1024x1024, prodigy optimizer. I'm running on Ubuntu Linux Server with a 3090Ti

here is the output from nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090 Ti     Off | 00000000:06:10.0 Off |                  Off |
| 30%   49C    P2              99W / 450W |  20738MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     11642      C   ...naconda3/envs/sd-scripts/bin/python    20732MiB |
+---------------------------------------------------------------------------------------+

my system RAM usage is at 11.61GB of 12GB and has been since the start of training.

no programs are running aside from Tailscale and system applications.

here are my flags

accelerate launch --num_cpu_threads_per_process 4 sdxl_train_network.py \
    --logging_dir="logs" --log_prefix="$lora_name" \
    --network_module="networks.lora" \
    --max_data_loader_n_workers=1 --persistent_data_loader_workers \
    --caption_extension=".txt" --shuffle_caption --keep_tokens="$keep_tags" --max_token_length=225 \
    --prior_loss_weight=1 \
    --mixed_precision="fp16" --save_precision="fp16" \
    --xformers --cache_latents \
    --save_model_as=safetensors \
    --train_data_dir="$image_dir" --output_dir="$unique_output" --reg_data_dir="$reg_dir" --pretrained_model_name_or_path="$model_dir$model" \
    --output_name="$full_name"_ \
    --learning_rate="$unet_lr" --unet_lr="$unet_lr" --text_encoder_lr="$text_enc_lr" \
    --max_train_steps="$real_steps" --save_every_n_steps="$save_nth_step" \
    --resolution="$base_res" \
    --enable_bucket --min_bucket_reso="$min_bucket_res" --max_bucket_reso="$max_bucket_res" \
    --train_batch_size="$batch_size" \
    --network_dim="$net_dim" --network_alpha="$net_alpha" \
    --optimizer_type="$optimizer" \
    --lr_scheduler="$scheduler" \
    --noise_offset="$noise_offset" \
    --seed=0 \
    --sample_every_n_steps="$save_nth_step" \
    --sample_prompts="$prompts" \
    --sample_sampler="k_euler_a" \
    --gradient_accumulation_steps="$grad_acc_step" \
    --min_snr_gamma=5 \
    --lowram \
    --bucket_no_upscale \
    --output_config \
    --no_half_vae \
    --cache_latents_to_disk \
    --save_state \
    --resume="/home/yolup/nasStorage/loraDatasets/furry_master_data/lora_tests/furry_master_data-net-alpha-4-net-dim-8-50000steps_ver-a1.0/furry_master_data-net-alpha-4-net-dim-8-50000steps_ver-a1.0_-step00008000-state"

and the settings area of the launch script

# Training Config

    # Basic Settings:
        real_steps=50000 # Total number of steps.
        save_amount=5 # How many LoRA checkpoints to save (e.g., 2000 steps / 10 saves == 1 save every 200 steps, 10 saves in total)
        base_res=1024 # The "base resolution" to train at.
        max_aspect=1.5 # Determines the most extreme allowed aspect ratio for bucketing.
        batch_size=1 # Amount of images to process per step. Speeds things up, but demands VRAM.

    # Advanced Settings:
        #unet_lr=$(awk 'BEGIN { printf "%.10f", 5*10^(-5) }') # Unet learning rate.
        #text_enc_lr=$(awk 'BEGIN { printf "%.10f", 5*10^(-5) }') # Text encoder learning rate.
        unet_lr=0.05 # Unet learning rate.
        text_enc_lr=0.05 # Text encoder learning rate.
        grad_acc_step=5 # Accumulates steps into sets of images. Can make training more reliable and successful, if used correctly.
        net_dim=8 # Network dimensions. 
        net_alpha=4 # Network alpha. 
        optimizer="Prodigy" # Valid values: "AdamW", "AdamW8bit", "Lion", "SGDNesterov", "SDGNesterov8bit", "DAdaptation", "AdaFactor", "Prodigy"
        scheduler="cosine_with_restarts" # Valid values: "linear", "cosine", "cosine_with_restarts", "polynomial", "constant", "constant_with_warmup", "adafactor"
        noise_offset=0.0005 # Increases dynamic range of outputs. Every 0.1 dampens learning quite a bit, do more steps or higher training rates to compensate.
        keep_tags=0 # Keeps <n> tags at the front without shuffling them. 0 if no regularization, 1 with regularization, multi concepts may need > 1 

    # Name Settings
        #$lora_name="hoakan-net-alpha-12" # Name of LoRA
        lora_name="${projectLoraName}-net-alpha-${net_alpha}-net-dim-${net_dim}-${real_steps}steps"
        version="ver-a1.0" # Version number (Completely optional, but recommended)

aaaand I just noticed my learning rate is set incorrectly, but that won't fix the issue.

kohya-ss / sd-scripts

Training slows down considerably #1088