kohya-ss / sd-scripts

Apache License 2.0
5.28k stars 877 forks source link

Training slows down considerably #1088

Open Yo1up opened 9 months ago

Yo1up commented 9 months ago

I'm running the SDXL LoRA training script on a dataset of ~50,000 images and I have noticed that training slows down considerably after several thousand training steps.

I have observed this behaviour before, however it was never an issue previously as I could get through multiple epochs in a couple of thousand training steps.

training starts at 3.16s/it and begins slowing down around step 6000, I have observed so far that it slows down to 5.32s/it however I have also observed the training script completely halting after sufficient time.

I am using a gradient accumulation of 5 for the training, a rank 8 LoRA, and a network alpha of 4, the target resolution is 1024x1024, prodigy optimizer. I'm running on Ubuntu Linux Server with a 3090Ti

here is the output from nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090 Ti     Off | 00000000:06:10.0 Off |                  Off |
| 30%   49C    P2              99W / 450W |  20738MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     11642      C   ...naconda3/envs/sd-scripts/bin/python    20732MiB |
+---------------------------------------------------------------------------------------+

my system RAM usage is at 11.61GB of 12GB and has been since the start of training.

no programs are running aside from Tailscale and system applications.

here are my flags

accelerate launch --num_cpu_threads_per_process 4 sdxl_train_network.py \
    --logging_dir="logs" --log_prefix="$lora_name" \
    --network_module="networks.lora" \
    --max_data_loader_n_workers=1 --persistent_data_loader_workers \
    --caption_extension=".txt" --shuffle_caption --keep_tokens="$keep_tags" --max_token_length=225 \
    --prior_loss_weight=1 \
    --mixed_precision="fp16" --save_precision="fp16" \
    --xformers --cache_latents \
    --save_model_as=safetensors \
    --train_data_dir="$image_dir" --output_dir="$unique_output" --reg_data_dir="$reg_dir" --pretrained_model_name_or_path="$model_dir$model" \
    --output_name="$full_name"_ \
    --learning_rate="$unet_lr" --unet_lr="$unet_lr" --text_encoder_lr="$text_enc_lr" \
    --max_train_steps="$real_steps" --save_every_n_steps="$save_nth_step" \
    --resolution="$base_res" \
    --enable_bucket --min_bucket_reso="$min_bucket_res" --max_bucket_reso="$max_bucket_res" \
    --train_batch_size="$batch_size" \
    --network_dim="$net_dim" --network_alpha="$net_alpha" \
    --optimizer_type="$optimizer" \
    --lr_scheduler="$scheduler" \
    --noise_offset="$noise_offset" \
    --seed=0 \
    --sample_every_n_steps="$save_nth_step" \
    --sample_prompts="$prompts" \
    --sample_sampler="k_euler_a" \
    --gradient_accumulation_steps="$grad_acc_step" \
    --min_snr_gamma=5 \
    --lowram \
    --bucket_no_upscale \
    --output_config \
    --no_half_vae \
    --cache_latents_to_disk \
    --save_state \
    --resume="/home/yolup/nasStorage/loraDatasets/furry_master_data/lora_tests/furry_master_data-net-alpha-4-net-dim-8-50000steps_ver-a1.0/furry_master_data-net-alpha-4-net-dim-8-50000steps_ver-a1.0_-step00008000-state"

and the settings area of the launch script

# Training Config

    # Basic Settings:
        real_steps=50000 # Total number of steps.
        save_amount=5 # How many LoRA checkpoints to save (e.g., 2000 steps / 10 saves == 1 save every 200 steps, 10 saves in total)
        base_res=1024 # The "base resolution" to train at.
        max_aspect=1.5 # Determines the most extreme allowed aspect ratio for bucketing.
        batch_size=1 # Amount of images to process per step. Speeds things up, but demands VRAM.

    # Advanced Settings:
        #unet_lr=$(awk 'BEGIN { printf "%.10f", 5*10^(-5) }') # Unet learning rate.
        #text_enc_lr=$(awk 'BEGIN { printf "%.10f", 5*10^(-5) }') # Text encoder learning rate.
        unet_lr=0.05 # Unet learning rate.
        text_enc_lr=0.05 # Text encoder learning rate.
        grad_acc_step=5 # Accumulates steps into sets of images. Can make training more reliable and successful, if used correctly.
        net_dim=8 # Network dimensions. 
        net_alpha=4 # Network alpha. 
        optimizer="Prodigy" # Valid values: "AdamW", "AdamW8bit", "Lion", "SGDNesterov", "SDGNesterov8bit", "DAdaptation", "AdaFactor", "Prodigy"
        scheduler="cosine_with_restarts" # Valid values: "linear", "cosine", "cosine_with_restarts", "polynomial", "constant", "constant_with_warmup", "adafactor"
        noise_offset=0.0005 # Increases dynamic range of outputs. Every 0.1 dampens learning quite a bit, do more steps or higher training rates to compensate.
        keep_tags=0 # Keeps <n> tags at the front without shuffling them. 0 if no regularization, 1 with regularization, multi concepts may need > 1 

    # Name Settings
        #$lora_name="hoakan-net-alpha-12" # Name of LoRA
        lora_name="${projectLoraName}-net-alpha-${net_alpha}-net-dim-${net_dim}-${real_steps}steps"
        version="ver-a1.0" # Version number (Completely optional, but recommended)

aaaand I just noticed my learning rate is set incorrectly, but that won't fix the issue.

V1sionVerse commented 9 months ago

It's probably a memory leak and some GPU memory is swapped out into "shared GPU memory" (i.e. main memory), which will slow down the training considerably (and still noticeably even if it's just a small amount that's being swapped). I have found that each time a sample is generated, that seems to release a bit of GPU memory and most crucially, it seems to move memory that has been swapped into main memory back to the dedicated GPU memory.

In my setup I don't use the samples generated by kohya_ss at all (I have a second GPU that I can run inference on, which produces much higher quality results through the Fooocus web ui), but I still let kohya_ss produce a dummy sample at 512x512 every 500 steps for this exact reason. This is on Windows 10 though so YMMV with other operating systems.

I would also recommend not using the main GPU for anything else during training (i.e. connect your monitor to either an integrated GPU or a cheap secondary GPU) to limit the possibility of other applications allocating VRAM on the GPU you're doing the training on.