hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible
https://www.colossalai.org
Apache License 2.0
38.77k stars 4.34k forks source link

[BUG]: How to run llama2 70B pretrain on 32gpus? I got OOM error on almost every plugin and config. #5139

Open yeegnauh opened 11 months ago

yeegnauh commented 11 months ago

🐛 Describe the bug

gemini / gemini_auto / zero2 / hybrid_parallel I have tried and still got OOM error.

with hybrid_parallel plugin , I tried configs as follows:

  1. tp=8, pp=1, zero=2, microbatch_size=1, precision="fp16"
  2. tp=4, pp=2, zero=1, microbatch_size=1 etc..

Is there anybody that trained llama 65B normally ?

Environment

torch1.13.1 + cu117 python 3.10

Fridge003 commented 11 months ago

Hi,what's your batch size on each gpu? The microbatch size is the unit of passing when using pipeline parallel.

If your batch size is more than 1, I recommend lower the batch size since the activation memory can be greatly reduced.

yeegnauh commented 11 months ago

Hi,what's your batch size on each gpu? The microbatch size is the unit of passing when using pipeline parallel.

If your batch size is more than 1, I recommend lower the batch size since the activation memory can be greatly reduced.

Yes, I set batch_size=1 in the experiment. Do you have any recommendation for other configs?

Fridge003 commented 11 months ago

If the OOM error happens before the training loop, initialize the model under LazyInitContext might solve the problem (the usage can be referred to examples/language/llama2/pretrain.py)

If the OOM happens during training, there are two optimization methods coming to me: