Open doem97 opened 1 year ago
because one gpu need to compute "gradient = (gradient_from_gpu_1 + gradient_from_gpu_2) / 2" This computation will take many vram.
gradient_from_gpu_1 + gradient_from_gpu_2
Thanks @lllyasviel !
So basically bottleneck is the one holding gradient averaging, while remaining should work fine, e.g., GPU0 require 24G+24G; GPU1 require 24G; GPU2 require 24G; GPU3 require 24G.
THUS, we should ensure GPU0 some space, e.g., work with GPU0 12G+12G, GPU1 12G, GPU2 12G, GPU3 12G.
Sry I got question again.
I travel from recognition community. In recognition, normally the multi-GPU training won't result in significant different RAMs among GPUs. Does this "1-big-gpu" thing only happen in stable diffusion/control net?
use fsdp or deepspeed training strategy
HuggingFace Diffusers ControlNet training script https://huggingface.co/docs/diffusers/training/controlnet has different optimizations builtin
I can run the original
tutorial_train.py
with single 3090Ti GPU (24G) with batch_size 3.However, when upgrade to 2 or more gpus, it keep warning OOM.
I am curious why? Why single GPU can handle batch 3 while multi-GPU can only handle 1?? The GPUS hold batches on their own parallelly, am I right?