kohya-ss / sd-scripts

Apache License 2.0
5.29k stars 878 forks source link

SD 1.5 DreamBooth training uses more VRAM than SDXL DreamBooth #1075

Open FurkanGozukara opened 9 months ago

FurkanGozukara commented 9 months ago

I am using same settings in both cases and how much VRAM SD 1.5 uses is insane

Here below the full config.

All training and reg images are 768x768

However SD 1.5 training faster than SDXL. That is the only expected thing :)

SDXL uses 17 GB VRAM meanwhile SD 1.5 uses 22.5 GB - tested on RunPod Linux - no desktop GUI

accelerate launch --num_cpu_threads_per_process=4 "./train_db.py" --pretrained_model_name_or_path="/workspace/stable-diffusion-webui/models/Stable-diffusion/hyper_real_v3.safetensors" --train_data_dir="/workspace/train" --reg_data_dir="/workspace/reg"      
                         --resolution="768,768" --output_dir="/workspace/stable-diffusion-webui/models/Stable-diffusion" --logging_dir="/workspace/stable-diffusion-webui/models/Stable-diffusion" --save_model_as=safetensors --full_bf16 --output_name="6e5"                            
                         --lr_scheduler_num_cycles="1" --max_data_loader_n_workers="0" --learning_rate_te="6e-05" --learning_rate="6e-05" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="4500" --save_every_n_epochs="1" --mixed_precision="bf16"                    
                         --save_precision="bf16" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False weight_decay=0.01 --max_data_loader_n_workers="0" --bucket_reso_steps=64               
                         --gradient_checkpointing --bucket_no_upscale --noise_offset=0.0 --max_grad_norm=0.0 --no_half_vae 
FurkanGozukara commented 9 months ago

also when --no_half_vae used or not used didn't make a bit difference. exactly same VRAM usage

kohya-ss commented 9 months ago

This is because --full_bf16 is not supported in SD1.5 training (train_db.py and fine_tune.py). I'd like to add the feature in near future.

When --no_half_vae is used, VAE is float32 and uses more RAM, but VAE is kept on main RAM during training, so the VRAM usage is same.

FurkanGozukara commented 9 months ago

This is because --full_bf16 is not supported in SD1.5 training (train_db.py and fine_tune.py). I'd like to add the feature in near future.

When --no_half_vae is used, VAE is float32 and uses more RAM, but VAE is kept on main RAM during training, so the VRAM usage is same.

thanks a lot looking forward to it

FurkanGozukara commented 9 months ago

when I tried to train SD 1.5 at 1024x1024 pixels without xformers , the SD 1.5, it used more than 24 GB VRAM and got error

the same settings on SDXL uses 17 GB VRAM with 1024x1024 - no xFormers

when xFormers enabled , it is reduced to 10GB on SD 1.5, by the way SDXL do not bring down VRAM such amount. it reduces like 1-2 GB at max. on SD 1.5 it reduced more than 14 GB VRAM

Do you know this huge dramatic difference of xFormers on SD 1.5 training @kohya-ss

I think that some optimizations are not getting activated if xFormers is not enabled by mistake

FurkanGozukara commented 9 months ago

full bf16 used exactly same vram as no-mixed precision + float training for SD 1.5 so full bf 16 or mixed precision training still not working for SD 1.5 DreamBooth

also i did set text encoder training learning rate 0 and uses same VRAM as training text encoder

all i am talking about SD 1.5 DreamBooth

kohya-ss commented 9 months ago

SD 1.5 has transformer blocks in 1st depth. If the image reso is 768x768, the latents reso is 96x96, and the sequence length of the input for transformer in 1st depth is H*W=96*96=9,216. In my understanding, since transformer uses memory of the square of the sequence length, this uses a very large amount of memory.

In contrast, SDXL only has a transformer after the second depth. The sequence length is 48*48=2,304.

2304^2=5,308,416, this is clearly less than 9216^2=84,934,656. So even SDXL has more transformer blocks, it uses less memory in the larger resolution than SD1.5.

full_bf16 will not work, but mixed precision with bf16 should work. Could you please check your settings?

FurkanGozukara commented 9 months ago

SD 1.5 has transformer blocks in 1st depth. If the image reso is 768x768, the latents reso is 96x96, and the sequence length of the input for transformer in 1st depth is HW=9696=9,216. In my understanding, since transformer uses memory of the square of the sequence length, this uses a very large amount of memory.

In contrast, SDXL only has a transformer after the second depth. The sequence length is 48*48=2,304.

2304^2=5,308,416, this is clearly less than 9216^2=84,934,656. So even SDXL has more transformer blocks, it uses less memory in the larger resolution than SD1.5.

full_bf16 will not work, but mixed precision with bf16 should work. Could you please check your settings?

thanks i should test it. what difference has mixed precision vs full bf16 can you give some more info