Open FurkanGozukara opened 9 months ago
also when --no_half_vae
used or not used didn't make a bit difference. exactly same VRAM usage
This is because --full_bf16
is not supported in SD1.5 training (train_db.py and fine_tune.py). I'd like to add the feature in near future.
When --no_half_vae
is used, VAE is float32 and uses more RAM, but VAE is kept on main RAM during training, so the VRAM usage is same.
This is because
--full_bf16
is not supported in SD1.5 training (train_db.py and fine_tune.py). I'd like to add the feature in near future.When
--no_half_vae
is used, VAE is float32 and uses more RAM, but VAE is kept on main RAM during training, so the VRAM usage is same.
thanks a lot looking forward to it
when I tried to train SD 1.5 at 1024x1024 pixels without xformers , the SD 1.5, it used more than 24 GB VRAM and got error
the same settings on SDXL uses 17 GB VRAM with 1024x1024 - no xFormers
when xFormers enabled , it is reduced to 10GB on SD 1.5, by the way SDXL do not bring down VRAM such amount. it reduces like 1-2 GB at max. on SD 1.5 it reduced more than 14 GB VRAM
Do you know this huge dramatic difference of xFormers on SD 1.5 training @kohya-ss
I think that some optimizations are not getting activated if xFormers is not enabled by mistake
full bf16 used exactly same vram as no-mixed precision + float training for SD 1.5 so full bf 16 or mixed precision training still not working for SD 1.5 DreamBooth
also i did set text encoder training learning rate 0 and uses same VRAM as training text encoder
all i am talking about SD 1.5 DreamBooth
SD 1.5 has transformer blocks in 1st depth. If the image reso is 768x768, the latents reso is 96x96, and the sequence length of the input for transformer in 1st depth is H*W=96*96=9,216. In my understanding, since transformer uses memory of the square of the sequence length, this uses a very large amount of memory.
In contrast, SDXL only has a transformer after the second depth. The sequence length is 48*48=2,304.
2304^2=5,308,416, this is clearly less than 9216^2=84,934,656. So even SDXL has more transformer blocks, it uses less memory in the larger resolution than SD1.5.
full_bf16
will not work, but mixed precision with bf16 should work. Could you please check your settings?
SD 1.5 has transformer blocks in 1st depth. If the image reso is 768x768, the latents reso is 96x96, and the sequence length of the input for transformer in 1st depth is HW=9696=9,216. In my understanding, since transformer uses memory of the square of the sequence length, this uses a very large amount of memory.
In contrast, SDXL only has a transformer after the second depth. The sequence length is 48*48=2,304.
2304^2=5,308,416, this is clearly less than 9216^2=84,934,656. So even SDXL has more transformer blocks, it uses less memory in the larger resolution than SD1.5.
full_bf16
will not work, but mixed precision with bf16 should work. Could you please check your settings?
thanks i should test it. what difference has mixed precision vs full bf16 can you give some more info
I am using same settings in both cases and how much VRAM SD 1.5 uses is insane
Here below the full config.
All training and reg images are 768x768
However SD 1.5 training faster than SDXL. That is the only expected thing :)
SDXL uses 17 GB VRAM meanwhile SD 1.5 uses 22.5 GB - tested on RunPod Linux - no desktop GUI