hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible
https://www.colossalai.org
Apache License 2.0
38.78k stars 4.34k forks source link

22919MiB*4 计算资源情况下,torchrun --standalone --nproc_per_node 4 benchmark_gpt_dummy.py --model m --strategy ddp --experience_batch_size 1 --train_batch_size 1 策略OOM。 #2751

Closed ct1976 closed 1 year ago

ct1976 commented 1 year ago

🐛 Describe the bug

相关日志: WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Actor: 338.39 M Critic: 338.39 M Initial model: 338.39 M Reward model: 338.39 M

Train epoch [1/3]: 50%|████████████████████████▌ | 1/2 [00:01<00:01, 1.68s/it, actor_loss=0.0324, critic_loss=0.00114] Episode [1/3]: 88%|█████████████████████████████████████████████████████████████████████████████████▍ | 7/8 [01:18<00:11, 11.27s/it] Traceback (most recent call last): File "/dev/ml/ColossalAI/applications/ChatGPT/benchmarks/benchmark_gpt_dummy.py", line 180, in main(args) File "/dev/ml/ColossalAI/applications/ChatGPT/benchmarks/benchmark_gpt_dummy.py", line 156, in main trainer.fit(random_prompts, File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/chatgpt/trainer/base.py", line 118, in fit self._learn() File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/chatgpt/trainer/base.py", line 94, in _learn metrics = self.training_step(experience) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/chatgpt/trainer/ppo.py", line 86, in training_step action_log_probs = self.actor(experience.sequences, num_actions, attention_mask=experience.attention_mask) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/chatgpt/nn/actor.py", line 59, in forward output = self.model(sequences, attention_mask=attention_mask) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 886, in forward output = self.module(inputs[0], kwargs[0]) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1065, in forward lm_logits = self.lm_head(hidden_states) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 103, in forward return F.linear(input, self.weight, self.bias) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/functional.py", line 1848, in linear return torch._C._nn.linear(input, weight, bias) RuntimeError: CUDA out of memory. Tried to allocate 100.00 MiB (GPU 0; 22.38 GiB total capacity; 21.17 GiB already allocated; 23.94 MiB free; 21.43 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "/dev/ml/ColossalAI/applications/ChatGPT/benchmarks/benchmark_gpt_dummy.py", line 180, in main(args) File "/dev/ml/ColossalAI/applications/ChatGPT/benchmarks/benchmark_gpt_dummy.py", line 156, in main trainer.fit(random_prompts, File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/chatgpt/trainer/base.py", line 118, in fit self._learn() File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/chatgpt/trainer/base.py", line 94, in _learn metrics = self.training_step(experience) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/chatgpt/trainer/ppo.py", line 86, in training_step action_log_probs = self.actor(experience.sequences, num_actions, attention_mask=experience.attention_mask) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/chatgpt/nn/actor.py", line 59, in forward output = self.model(sequences, attention_mask=attention_mask) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 886, in forward output = self.module(*inputs[0], *kwargs[0]) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1065, in forward lm_logits = self.lm_head(hidden_states) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 103, in forward return F.linear(input, self.weight, self.bias) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/functional.py", line 1848, in linear return torch._C._nn.linear(input, weight, bias) RuntimeError: CUDA out of memory. Tried to allocate 100.00 MiB (GPU 3; 22.38 GiB total capacity; 21.17 GiB already allocated; 39.94 MiB free; 21.43 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "/dev/ml/ColossalAI/applications/ChatGPT/benchmarks/benchmark_gpt_dummy.py", line 180, in main(args) File "/dev/ml/ColossalAI/applications/ChatGPT/benchmarks/benchmark_gpt_dummy.py", line 156, in main trainer.fit(random_prompts, File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/chatgpt/trainer/base.py", line 118, in fit self._learn() File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/chatgpt/trainer/base.py", line 94, in _learn metrics = self.training_step(experience) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/chatgpt/trainer/ppo.py", line 86, in training_step action_log_probs = self.actor(experience.sequences, num_actions, attention_mask=experience.attention_mask) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/chatgpt/nn/actor.py", line 59, in forward output = self.model(sequences, attention_mask=attention_mask) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 886, in forward output = self.module(*inputs[0], kwargs[0]) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1065, in forward lm_logits = self.lm_head(hidden_states) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 103, in forward return F.linear(input, self.weight, self.bias) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/functional.py", line 1848, in linear return torch._C._nn.linear(input, weight, bias) RuntimeError: CUDA out of memory. Tried to allocate 100.00 MiB (GPU 2; 22.38 GiB total capacity; 21.17 GiB already allocated; 27.94 MiB free; 21.42 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "/dev/ml/ColossalAI/applications/ChatGPT/benchmarks/benchmark_gpt_dummy.py", line 180, in main(args) File "/dev/ml/ColossalAI/applications/ChatGPT/benchmarks/benchmark_gpt_dummy.py", line 156, in main trainer.fit(random_prompts, File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/chatgpt/trainer/base.py", line 118, in fit self._learn() File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/chatgpt/trainer/base.py", line 94, in _learn metrics = self.training_step(experience) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/chatgpt/trainer/ppo.py", line 86, in training_step action_log_probs = self.actor(experience.sequences, num_actions, attention_mask=experience.attention_mask) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/chatgpt/nn/actor.py", line 59, in forward output = self.model(sequences, attention_mask=attention_mask) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 886, in forward output = self.module(inputs[0], kwargs[0]) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1065, in forward lm_logits = self.lm_head(hidden_states) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, *kwargs) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 103, in forward return F.linear(input, self.weight, self.bias) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/nn/functional.py", line 1848, in linear return torch._C._nn.linear(input, weight, bias) RuntimeError: CUDA out of memory. Tried to allocate 100.00 MiB (GPU 1; 22.38 GiB total capacity; 21.17 GiB already allocated; 35.94 MiB free; 21.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 12129) of binary: /dev/ml/anaconda3/envs/py39/bin/python3.9 Traceback (most recent call last): File "/dev/ml/anaconda3/envs/py39/bin/torchrun", line 8, in sys.exit(main()) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(args, kwargs) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 719, in main run(args) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/dev/ml/anaconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

benchmark_gpt_dummy.py FAILED

Environment

Thu Feb 16 15:52:55 2023
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.102.04 Driver Version: 450.102.04 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla P40 Off | 00000000:5A:00.0 Off | 0 | | N/A 30C P8 9W / 250W | 2MiB / 22919MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla P40 Off | 00000000:5E:00.0 Off | 0 | | N/A 25C P8 9W / 250W | 2MiB / 22919MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla P40 Off | 00000000:62:00.0 Off | 0 | | N/A 27C P8 10W / 250W | 2MiB / 22919MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla P40 Off | 00000000:66:00.0 Off | 0 | | N/A 27C P8 10W / 250W | 2MiB / 22919MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Title: 22919MiB*4 In the case of computing resources, torchrun --standalone --nproc_per_node 4 benchmark_gpt_dummy.py --model m --strategy ddp --experience_batch_size 1 --train_batch_size 1 strategy OOM.

ht-zhou commented 1 year ago

Thanks for your feedback, you have used DDP strategy which is too naive and costs much more GPU mem. You can try

torchrun --standalone --nproc_per_node 4 benchmark_gpt_dummy.py --model m --strategy colossalai_zero2 --experience_batch_size 1 --train_batch_size 1

or

torchrun --standalone --nproc_per_node 4 benchmark_gpt_dummy.py --model m --strategy colossalai_gemini --experience_batch_size 1 --train_batch_size 1

And you can find amazing improvement in GPU mem costing.