[BUG]: How to run llama2 70B pretrain on 32gpus? I got OOM error on almost every plugin and config.

yeegnauh commented 11 months ago

🐛 Describe the bug

gemini / gemini_auto / zero2 / hybrid_parallel I have tried and still got OOM error.

with hybrid_parallel plugin , I tried configs as follows:

tp=8, pp=1, zero=2, microbatch_size=1, precision="fp16"
tp=4, pp=2, zero=1, microbatch_size=1 etc..

Is there anybody that trained llama 65B normally ?

Environment

torch1.13.1 + cu117 python 3.10

Fridge003 commented 11 months ago

Hi，what's your batch size on each gpu? The microbatch size is the unit of passing when using pipeline parallel.

If your batch size is more than 1, I recommend lower the batch size since the activation memory can be greatly reduced.

yeegnauh commented 11 months ago

Hi，what's your batch size on each gpu? The microbatch size is the unit of passing when using pipeline parallel.

If your batch size is more than 1, I recommend lower the batch size since the activation memory can be greatly reduced.

Yes, I set batch_size=1 in the experiment. Do you have any recommendation for other configs?

Fridge003 commented 11 months ago

If the OOM error happens before the training loop, initialize the model under LazyInitContext might solve the problem (the usage can be referred to examples/language/llama2/pretrain.py)

If the OOM happens during training, there are two optimization methods coming to me:

you can set the offload_optim_frac argument to a value between 0 and 1 (the smallest value that avoids OOM) for GeminiPlugin. Or setting cpu_offload argument to True for LowLevelZeroPlugin or HybridParallelPlugin. Their function is similar, offload optimizer states to cpu memory to avoid OOM in GPU.
you can set enable_flash_attention to True for GeminiPlugin and HybridParallelPlugin, since flash attention will not only accelerate training but also save GPU memory

hpcaitech / ColossalAI

[BUG]: How to run llama2 70B pretrain on 32gpus? I got OOM error on almost every plugin and config. #5139

🐛 Describe the bug

Environment