microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.93k stars 4.06k forks source link

[BUG] 1: error: must run as root and 2: raise RuntimeError("Ninja is required to load C++ extensions") #5627

Open YangBrooksHan opened 3 months ago

YangBrooksHan commented 3 months ago

Describe the bug Encountered the following errors while training a large language model with DeepSpeed on multiple nodes

172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... 172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... 172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... 172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... 172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... 172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... 172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... 172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... 172.27.221.56: die: error: must run as root[rank5]: Traceback (most recent call last): 172.27.221.56: [rank5]: File "/data/hy_workspace/mSR_conda/safe-rlhf/test.py", line 88, in 172.27.221.56: [rank5]: main() 172.27.221.56: [rank5]: File "/data/hy_workspace/mSR_conda/safe-rlhf/test.py", line 64, in main 172.27.221.56: [rank5]: optimizer = FusedAdam( 172.27.221.56: [rank5]: ^^^^^^^^^^ 172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in init 172.27.221.56: [rank5]: fused_adam_cuda = FusedAdamBuilder().load() 172.27.221.56: [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^ 172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in load 172.27.221.56: [rank5]: return self.jit_load(verbose) 172.27.221.56: [rank5]: ^^^^^^^^^^^^^^^^^^^^^^ 172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 524, in jit_load 172.27.221.56: [rank5]: op_module = load(name=self.name, 172.27.221.56: [rank5]: ^^^^^^^^^^^^^^^^^^^^ 172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1309, in load 172.27.221.56: [rank5]: return _jit_compile( 172.27.221.56: [rank5]: ^^^^^^^^^^^^^ 172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1719, in _jit_compile 172.27.221.56: [rank5]: _write_ninja_file_and_build_library( 172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1802, in _write_ninja_file_and_build_library 172.27.221.56: [rank5]: verify_ninja_availability() 172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1853, in verify_ninja_availability 172.27.221.56: [rank5]: raise RuntimeError("Ninja is required to load C++ extensions") 172.27.221.56: [rank5]: RuntimeError: Ninja is required to load C++ extensions 172.27.221.62: Detected CUDA files, patching ldflags 172.27.221.62: Emitting ninja build file /home/hy/.cache/torch_extensions/py311_cu121/fused_adam/build.ninja... 172.27.221.62: /home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 172.27.221.62: If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. 172.27.221.62: warnings.warn( 172.27.221.62: Building extension module fused_adam... 172.27.221.62: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) 172.27.221.62: ninja: no work to do. 172.27.221.62: Loading extension module fused_adam... 172.27.221.62: Time to load fused_adam op: 0.041127920150756836 seconds 172.27.221.62: Loading extension module fused_adam...Loading extension module fused_adam...Loading extension module fused_adam... 172.27.221.62: 172.27.221.62: 172.27.221.62: Loading extension module fused_adam... 172.27.221.62: Loading extension module fused_adam... 172.27.221.62: Loading extension module fused_adam... 172.27.221.62: Loading extension module fused_adam... 172.27.221.62: Time to load fused_adam op: 0.10191988945007324 secondsTime to load fused_adam op: 0.1019446849822998 secondsTime to load fused_adam op: 0.10191512107849121 seconds 172.27.221.62: 172.27.221.62: 172.27.221.62: Time to load fused_adam op: 0.10192584991455078 seconds 172.27.221.62: Time to load fused_adam op: 0.10190796852111816 seconds 172.27.221.62: Time to load fused_adam op: 0.10191154479980469 seconds 172.27.221.62: Time to load fused_adam op: 0.1019294261932373 seconds 172.27.221.56: Loading extension module fused_adam...Loading extension module fused_adam...Loading extension module fused_adam...Loading extension module fused_adam... 172.27.221.56: 172.27.221.56: 172.27.221.56: 172.27.221.56: Loading extension module fused_adam... 172.27.221.56: Loading extension module fused_adam... 172.27.221.56: Loading extension module fused_adam... 172.27.221.56: Time to load fused_adam op: 0.10198068618774414 secondsTime to load fused_adam op: 0.10190844535827637 secondsTime to load fused_adam op: 0.10196852684020996 secondsTime to load fused_adam op: 0.10200953483581543 secondsTime to load fused_adam op: 0.10203671455383301 secondsTime to load fused_adam op: 0.10205721855163574 secondsTime to load fused_adam op: 0.10202836990356445 seconds 172.27.221.56: 172.27.221.56: 172.27.221.56: 172.27.221.56: 172.27.221.56: 172.27.221.56: 172.27.221.56: [2024-06-07 22:12:57,940] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.2, git-hash=unknown, git-branch=unknown 172.27.221.56: [2024-06-07 22:12:57,940] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized 172.27.221.56: node56:56374:57122 [5] NCCL INFO [Service thread] Connection closed by localRank 5 172.27.221.56: node56:56374:57169 [0] NCCL INFO comm 0xa198190 rank 5 nranks 16 cudaDev 5 busId 8f000 - Abort COMPLETE 172.27.221.56: [2024-06-07 22:12:59,286] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56369 172.27.221.56: [2024-06-07 22:12:59,422] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56370 172.27.221.56: [2024-06-07 22:12:59,546] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56371 172.27.221.56: [2024-06-07 22:12:59,674] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56372 172.27.221.56: [2024-06-07 22:12:59,802] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56373 172.27.221.56: [2024-06-07 22:12:59,928] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56374 172.27.221.56: [2024-06-07 22:12:59,929] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56375 172.27.221.56: [2024-06-07 22:13:00,055] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56376 172.27.221.56: [2024-06-07 22:13:00,183] [ERROR] [launch.py:325:sigkill_handler] ['/home/hy/anaconda3/envs/algmnode1/bin/python', '-u', 'test.py', '--local_rank=7'] exits with return code = 1