Open yeegnauh opened 11 months ago
Hi,what's your batch size on each gpu? The microbatch size is the unit of passing when using pipeline parallel.
If your batch size is more than 1, I recommend lower the batch size since the activation memory can be greatly reduced.
Hi,what's your batch size on each gpu? The microbatch size is the unit of passing when using pipeline parallel.
If your batch size is more than 1, I recommend lower the batch size since the activation memory can be greatly reduced.
Yes, I set batch_size=1 in the experiment. Do you have any recommendation for other configs?
If the OOM error happens before the training loop, initialize the model under LazyInitContext
might solve the problem (the usage can be referred to examples/language/llama2/pretrain.py
)
If the OOM happens during training, there are two optimization methods coming to me:
offload_optim_frac
argument to a value between 0 and 1 (the smallest value that avoids OOM) for GeminiPlugin
. Or setting cpu_offload
argument to True for LowLevelZeroPlugin
or HybridParallelPlugin
. Their function is similar, offload optimizer states to cpu memory to avoid OOM in GPU.enable_flash_attention
to True for GeminiPlugin
and HybridParallelPlugin
, since flash attention will not only accelerate training but also save GPU memory
🐛 Describe the bug
gemini / gemini_auto / zero2 / hybrid_parallel I have tried and still got OOM error.
with hybrid_parallel plugin , I tried configs as follows:
Is there anybody that trained llama 65B normally ?
Environment
torch1.13.1 + cu117 python 3.10