DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
34.93k
stars
4.06k
forks
source link
[BUG] 1: error: must run as root and 2: raise RuntimeError("Ninja is required to load C++ extensions") #5627
Open
YangBrooksHan opened 3 months ago
Describe the bug Encountered the following errors while training a large language model with DeepSpeed on multiple nodes
172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... 172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... 172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... 172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... 172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... 172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... 172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... 172.27.221.56: Using /home/hy/.cache/torch_extensions/py311_cu121 as PyTorch extensions root... 172.27.221.56: die: error: must run as root[rank5]: Traceback (most recent call last): 172.27.221.56: [rank5]: File "/data/hy_workspace/mSR_conda/safe-rlhf/test.py", line 88, in
172.27.221.56: [rank5]: main()
172.27.221.56: [rank5]: File "/data/hy_workspace/mSR_conda/safe-rlhf/test.py", line 64, in main
172.27.221.56: [rank5]: optimizer = FusedAdam(
172.27.221.56: [rank5]: ^^^^^^^^^^
172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/deepspeed/ops/adam/fused_adam.py", line 94, in init
172.27.221.56: [rank5]: fused_adam_cuda = FusedAdamBuilder().load()
172.27.221.56: [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^
172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in load
172.27.221.56: [rank5]: return self.jit_load(verbose)
172.27.221.56: [rank5]: ^^^^^^^^^^^^^^^^^^^^^^
172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/deepspeed/ops/op_builder/builder.py", line 524, in jit_load
172.27.221.56: [rank5]: op_module = load(name=self.name,
172.27.221.56: [rank5]: ^^^^^^^^^^^^^^^^^^^^
172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1309, in load
172.27.221.56: [rank5]: return _jit_compile(
172.27.221.56: [rank5]: ^^^^^^^^^^^^^
172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1719, in _jit_compile
172.27.221.56: [rank5]: _write_ninja_file_and_build_library(
172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1802, in _write_ninja_file_and_build_library
172.27.221.56: [rank5]: verify_ninja_availability()
172.27.221.56: [rank5]: File "/home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1853, in verify_ninja_availability
172.27.221.56: [rank5]: raise RuntimeError("Ninja is required to load C++ extensions")
172.27.221.56: [rank5]: RuntimeError: Ninja is required to load C++ extensions
172.27.221.62: Detected CUDA files, patching ldflags
172.27.221.62: Emitting ninja build file /home/hy/.cache/torch_extensions/py311_cu121/fused_adam/build.ninja...
172.27.221.62: /home/hy/anaconda3/envs/algmnode1/lib/python3.11/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
172.27.221.62: If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
172.27.221.62: warnings.warn(
172.27.221.62: Building extension module fused_adam...
172.27.221.62: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
172.27.221.62: ninja: no work to do.
172.27.221.62: Loading extension module fused_adam...
172.27.221.62: Time to load fused_adam op: 0.041127920150756836 seconds
172.27.221.62: Loading extension module fused_adam...Loading extension module fused_adam...Loading extension module fused_adam...
172.27.221.62:
172.27.221.62:
172.27.221.62: Loading extension module fused_adam...
172.27.221.62: Loading extension module fused_adam...
172.27.221.62: Loading extension module fused_adam...
172.27.221.62: Loading extension module fused_adam...
172.27.221.62: Time to load fused_adam op: 0.10191988945007324 secondsTime to load fused_adam op: 0.1019446849822998 secondsTime to load fused_adam op: 0.10191512107849121 seconds
172.27.221.62:
172.27.221.62:
172.27.221.62: Time to load fused_adam op: 0.10192584991455078 seconds
172.27.221.62: Time to load fused_adam op: 0.10190796852111816 seconds
172.27.221.62: Time to load fused_adam op: 0.10191154479980469 seconds
172.27.221.62: Time to load fused_adam op: 0.1019294261932373 seconds
172.27.221.56: Loading extension module fused_adam...Loading extension module fused_adam...Loading extension module fused_adam...Loading extension module fused_adam...
172.27.221.56:
172.27.221.56:
172.27.221.56:
172.27.221.56: Loading extension module fused_adam...
172.27.221.56: Loading extension module fused_adam...
172.27.221.56: Loading extension module fused_adam...
172.27.221.56: Time to load fused_adam op: 0.10198068618774414 secondsTime to load fused_adam op: 0.10190844535827637 secondsTime to load fused_adam op: 0.10196852684020996 secondsTime to load fused_adam op: 0.10200953483581543 secondsTime to load fused_adam op: 0.10203671455383301 secondsTime to load fused_adam op: 0.10205721855163574 secondsTime to load fused_adam op: 0.10202836990356445 seconds
172.27.221.56:
172.27.221.56:
172.27.221.56:
172.27.221.56:
172.27.221.56:
172.27.221.56:
172.27.221.56: [2024-06-07 22:12:57,940] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.14.2, git-hash=unknown, git-branch=unknown
172.27.221.56: [2024-06-07 22:12:57,940] [INFO] [comm.py:662:init_distributed] Distributed backend already initialized
172.27.221.56: node56:56374:57122 [5] NCCL INFO [Service thread] Connection closed by localRank 5
172.27.221.56: node56:56374:57169 [0] NCCL INFO comm 0xa198190 rank 5 nranks 16 cudaDev 5 busId 8f000 - Abort COMPLETE
172.27.221.56: [2024-06-07 22:12:59,286] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56369
172.27.221.56: [2024-06-07 22:12:59,422] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56370
172.27.221.56: [2024-06-07 22:12:59,546] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56371
172.27.221.56: [2024-06-07 22:12:59,674] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56372
172.27.221.56: [2024-06-07 22:12:59,802] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56373
172.27.221.56: [2024-06-07 22:12:59,928] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56374
172.27.221.56: [2024-06-07 22:12:59,929] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56375
172.27.221.56: [2024-06-07 22:13:00,055] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 56376
172.27.221.56: [2024-06-07 22:13:00,183] [ERROR] [launch.py:325:sigkill_handler] ['/home/hy/anaconda3/envs/algmnode1/bin/python', '-u', 'test.py', '--local_rank=7'] exits with return code = 1