kohya-ss / sd-scripts

Apache License 2.0
5.04k stars 843 forks source link

Kohya started using more VRAM for SDXL and using more than it should be #1131

Open FurkanGozukara opened 7 months ago

FurkanGozukara commented 7 months ago

I have a config which was running on Kaggle fine in previous versions

Right now it is failing on 15 GB gpu

This should not happen

Same settings on OneTrainer uses lesser than 13.5 GB VRAM

Here it fails with 15 GB

It wasn't failing before

All images are 1024x1024 All cached

Here the full training used prompt

I did trainings in past in Kaggle and this exact prompt was working i even have a video of it here

https://youtu.be/16-b1AjvyBE

  accelerate launch --num_cpu_threads_per_process=4      
                         "./sdxl_train.py" --max_grad_norm=0.0 --no_half_vae    
                         --train_text_encoder --ddp_timeout=10000000            
                         --ddp_gradient_as_bucket_view --bucket_no_upscale      
                         --bucket_reso_steps=64 --cache_latents                 
                         --cache_latents_to_disk --full_fp16                    
                         --gradient_checkpointing --learning_rate="1e-05"       
                         --learning_rate_te1="3e-06"                            
                         --logging_dir="/kaggle/working/results/log"            
                         --lr_scheduler="constant" --lr_scheduler_num_cycles="1"
                         --max_data_loader_n_workers="0"                        
                         --resolution="1024,1024" --max_train_steps="1500"      
                         --mem_eff_attn --mixed_precision="fp16"                
                         --optimizer_args scale_parameter=False                 
                         relative_step=False warmup_init=False weight_decay=0.01
                         --optimizer_type="Adafactor"                           
                         --output_dir="/kaggle/working/results/model"           
                         --output_name="2024_02_21_kaggle"                      
                         --pretrained_model_name_or_path="stabilityai/stable-dif
                         fusion-xl-base-1.0"                                    
                         --reg_data_dir="/kaggle/working/results/reg"           
                         --save_every_n_epochs="1" --save_model_as=safetensors  
                         --save_precision="fp16" --train_batch_size="1"         
                         --train_data_dir="/kaggle/working/results/img"         
                         --vae="stabilityai/sdxl-vae" --xformers 
Traceback (most recent call last):
  File "/kaggle/working/kohya_ss/./sdxl_train.py", line 779, in <module>
    train(args)
  File "/kaggle/working/kohya_ss/./sdxl_train.py", line 594, in train
    optimizer.step()
  File "/opt/conda/lib/python3.10/site-packages/accelerate/optimizer.py", line 132, in step
    self.scaler.step(self.optimizer, closure)
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 374, in step
    retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/amp/grad_scaler.py", line 290, in _maybe_opt_step
    retval = optimizer.step(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/optimizer.py", line 185, in patched_step
    return method(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/optimization.py", line 715, in step
    update = (grad**2) + group["eps"][0]
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 242.00 MiB (GPU 1; 14.75 GiB total capacity; 14.34 GiB already allocated; 53.06 MiB free; 14.47 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
FurkanGozukara commented 7 months ago

@kohya-ss

FurkanGozukara commented 7 months ago

currently it uses 15.7 GB minimum on Kaggle

So with P100 gpu it works but that means people can't use much faster T4 and Kaggle gives dual T4

Also who has 16 GB GPUs can't use properly either

kohya-ss commented 7 months ago

With these options, Text Encoder 2 is trained with the learning rate=1e-5, because --train_text_encoder is specified. I think OneTrainer may train Text Encoder 1 only. If you want to stop Text Encoder 2 training, please specify --learning_rate_te2=0.

FurkanGozukara commented 7 months ago

With these options, Text Encoder 2 is trained with the learning rate=1e-5, because --train_text_encoder is specified. I think OneTrainer may train Text Encoder 1 only. If you want to stop Text Encoder 2 training, please specify --learning_rate_te2=0.

wow this is a bug in that case because this is what bmaltais gui generates - i will report him will test thank you and reply back here

so when we don't provide TE2 what does trainer uses? because this is a big problem for me

FurkanGozukara commented 7 months ago

yep i verified this bug exists and breaks my config :/

thank you so much Kohya

Iipython commented 7 months ago

Hey, I am encountering the same problem today!! I have two cloned codes of sd-scripts. One was cloned in 12,2023, and the other was downloaded today. But I found the new code always reported"out of memory" by using the same configuration as follows: --pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0 \ --vae=madebyollin/sdxl-vae-fp16-fix \ --dataset_config=/home/lyh/sdvs/sd-scripts/config/finetune.toml \ --output_dir=/home/lyh/sd-scripts/output/finetune_15W \ --output_name=finetune_15W \ --save_model_as=safetensors \ --save_every_n_epochs=1 \ --save_precision="fp16" \ --max_token_length=225 \ --min_timestep=0 \ --max_timestep=1000 \ --max_train_epochs=2000 \ --learning_rate=4e-6 \ --lr_scheduler="constant" \ --optimizer_type="AdamW8bit" \ --xformers \ --gradient_checkpointing \ --gradient_accumulation_steps=128 \ --mem_eff_attn \ --mixed_precision="fp16" \ --logging_dir=logs \

The wired thing comes: the VRAM occupation with new code: 0ed4b03598426da65535656a625367f

the VRAM occupation with old code: da54cdaeb7aa6b8f4f82c2c7a3f7920

why? where is different?

kohya-ss commented 7 months ago

As I mentioned in #1141, multiple GPU issue seems to have another reason.

hufenghufeng commented 4 months ago

Hey, I am encountering the same problem today!! I have two cloned codes of sd-scripts. One was cloned in 12,2023, and the other was downloaded today. But I found the new code always reported"out of memory" by using the same configuration as follows: --pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0 \ --vae=madebyollin/sdxl-vae-fp16-fix \ --dataset_config=/home/lyh/sdvs/sd-scripts/config/finetune.toml \ --output_dir=/home/lyh/sd-scripts/output/finetune_15W \ --output_name=finetune_15W \ --save_model_as=safetensors \ --save_every_n_epochs=1 \ --save_precision="fp16" \ --max_token_length=225 \ --min_timestep=0 \ --max_timestep=1000 \ --max_train_epochs=2000 \ --learning_rate=4e-6 \ --lr_scheduler="constant" \ --optimizer_type="AdamW8bit" \ --xformers \ --gradient_checkpointing \ --gradient_accumulation_steps=128 \ --mem_eff_attn \ --mixed_precision="fp16" \ --logging_dir=logs \

The wired thing comes: the VRAM occupation with new code: 0ed4b03598426da65535656a625367f

the VRAM occupation with old code: da54cdaeb7aa6b8f4f82c2c7a3f7920

why? where is different?

same problem