microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.72k stars 4.15k forks source link

[BUG] deepspeed inference error on 2 node. #2931

Open lambda7xx opened 1 year ago

lambda7xx commented 1 year ago

torchrun --node_rank=1 --nnodes 2 --nproc_per_node=8 --master_addr xxxxx --master_port 6000 bloom-inference-scripts/bloom-ds-inference.py --name bigscience/bloom > 1.log 2>&1

I don't use the deepspeed command like below  becasue I forget my password.

deepspeed --num_gpus 8 --num_nodes 2 --master_addr XXXX --hostfile hostfile bloom-inference-scripts/bloom-ds-inference.py --name bigscience/bloom > 1.log 2>&1


The below is  my error log.
- Node 0

WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


[2023-03-03 02:21:44,523] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl *** Loading the model bigscience/bloom

Fetching 197 files: 0%| | 0/197 [00:00<?, ?it/s] Fetching 197 files: 100%|██████████| 197/197 [00:00<00:00, 4156.77it/s] PHLRR4036:24139:24139 [0] NCCL INFO Bootstrap : Using ens9f1:10.226.99.46<0> PHLRR4036:24139:24139 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. PHLRR4036:24139:24139 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) PHLRR4036:24139:24139 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. PHLRR4036:24139:24139 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) [1677810113.706248] [PHLRR4036:24139:f] vfs_fuse.c:281 UCX ERROR inotify_add_watch(/tmp) failed: No space left on device PHLRR4036:24139:24139 [0] NCCL INFO cudaDriverVersion 11080 NCCL version 2.14.3+cuda11.7 PHLRR4036:24141:24141 [2] NCCL INFO cudaDriverVersion 11080 PHLRR4036:24150:24150 [7] NCCL INFO cudaDriverVersion 11080 PHLRR4036:24148:24148 [6] NCCL INFO cudaDriverVersion 11080 PHLRR4036:24146:24146 [5] NCCL INFO cudaDriverVersion 11080 PHLRR4036:24140:24140 [1] NCCL INFO cudaDriverVersion 11080 PHLRR4036:24142:24142 [3] NCCL INFO cudaDriverVersion 11080 PHLRR4036:24139:24253 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so PHLRR4036:24139:24253 [0] NCCL INFO P2P plugin IBext PHLRR4036:24144:24144 [4] NCCL INFO cudaDriverVersion 11080 PHLRR4036:24141:24141 [2] NCCL INFO Bootstrap : Using ens9f1:10.226.99.46<0> PHLRR4036:24142:24142 [3] NCCL INFO Bootstrap : Using ens9f1:10.226.99.46<0> PHLRR4036:24150:24150 [7] NCCL INFO Bootstrap : Using ens9f1:10.226.99.46<0> PHLRR4036:24140:24140 [1] NCCL INFO Bootstrap : Using ens9f1:10.226.99.46<0> PHLRR4036:24146:24146 [5] NCCL INFO Bootstrap : Using ens9f1:10.226.99.46<0> PHLRR4036:24148:24148 [6] NCCL INFO Bootstrap : Using ens9f1:10.226.99.46<0> PHLRR4036:24141:24141 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. PHLRR4036:24141:24141 [2] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) PHLRR4036:24141:24141 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. PHLRR4036:24141:24141 [2] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) [1677810113.720920] [PHLRR4036:24141:f] vfs_fuse.c:281 UCX ERROR inotify_add_watch(/tmp) failed: No space left on device PHLRR4036:24142:24142 [3] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. PHLRR4036:24142:24142 [3] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) PHLRR4036:24142:24142 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. PHLRR4036:24142:24142 [3] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) PHLRR4036:24150:24150 [7] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. PHLRR4036:24150:24150 [7] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) PHLRR4036:24150:24150 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. PHLRR4036:24150:24150 [7] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) [1677810113.723766] [PHLRR4036:24142:f] vfs_fuse.c:281 UCX ERROR inotify_add_watch(/tmp) failed: No space left on device PHLRR4036:24141:24258 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so [1677810113.723912] [PHLRR4036:24150:f] vfs_fuse.c:281 UCX ERROR inotify_add_watch(/tmp) failed: No space left on device PHLRR4036:24141:24258 [2] NCCL INFO P2P plugin IBext PHLRR4036:24144:24144 [4] NCCL INFO Bootstrap : Using ens9f1:10.226.99.46<0> PHLRR4036:24140:24140 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. PHLRR4036:24140:24140 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) PHLRR4036:24140:24140 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. PHLRR4036:24140:24140 [1] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) [1677810113.725985] [PHLRR4036:24140:f] vfs_fuse.c:281 UCX ERROR inotify_add_watch(/tmp) failed: No space left on device PHLRR4036:24148:24148 [6] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. PHLRR4036:24148:24148 [6] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) PHLRR4036:24148:24148 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. PHLRR4036:24148:24148 [6] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) PHLRR4036:24146:24146 [5] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. PHLRR4036:24146:24146 [5] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) PHLRR4036:24146:24146 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. PHLRR4036:24146:24146 [5] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) [1677810113.726464] [PHLRR4036:24148:f] vfs_fuse.c:281 UCX ERROR inotify_add_watch(/tmp) failed: No space left on device [1677810113.726557] [PHLRR4036:24146:f] vfs_fuse.c:281 UCX ERROR inotify_add_watch(/tmp) failed: No space left on device PHLRR4036:24142:24263 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so PHLRR4036:24142:24263 [3] NCCL INFO P2P plugin IBext PHLRR4036:24150:24264 [7] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so PHLRR4036:24150:24264 [7] NCCL INFO P2P plugin IBext PHLRR4036:24144:24144 [4] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. PHLRR4036:24144:24144 [4] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) PHLRR4036:24144:24144 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. PHLRR4036:24144:24144 [4] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) [1677810113.729624] [PHLRR4036:24144:f] vfs_fuse.c:281 UCX ERROR inotify_add_watch(/tmp) failed: No space left on device PHLRR4036:24139:24253 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.99.46<0> PHLRR4036:24140:24267 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so PHLRR4036:24140:24267 [1] NCCL INFO P2P plugin IBext PHLRR4036:24139:24253 [0] NCCL INFO Using network IBext PHLRR4036:24146:24268 [5] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so PHLRR4036:24146:24268 [5] NCCL INFO P2P plugin IBext PHLRR4036:24148:24269 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so PHLRR4036:24148:24269 [6] NCCL INFO P2P plugin IBext PHLRR4036:24144:24272 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so PHLRR4036:24144:24272 [4] NCCL INFO P2P plugin IBext PHLRR4036:24141:24258 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.99.46<0> PHLRR4036:24141:24258 [2] NCCL INFO Using network IBext PHLRR4036:24142:24263 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.99.46<0> PHLRR4036:24142:24263 [3] NCCL INFO Using network IBext PHLRR4036:24150:24264 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.99.46<0> PHLRR4036:24150:24264 [7] NCCL INFO Using network IBext PHLRR4036:24146:24268 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.99.46<0> PHLRR4036:24146:24268 [5] NCCL INFO Using network IBext PHLRR4036:24148:24269 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.99.46<0> PHLRR4036:24148:24269 [6] NCCL INFO Using network IBext PHLRR4036:24140:24267 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.99.46<0> PHLRR4036:24140:24267 [1] NCCL INFO Using network IBext PHLRR4036:24144:24272 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.99.46<0> PHLRR4036:24144:24272 [4] NCCL INFO Using network IBext PHLRR4036:24148:24269 [6] NCCL INFO Setting affinity for GPU 6 to 0fe03f80 PHLRR4036:24141:24258 [2] NCCL INFO Setting affinity for GPU 2 to 1fc07f PHLRR4036:24146:24268 [5] NCCL INFO Setting affinity for GPU 5 to 0fe03f80 PHLRR4036:24139:24253 [0] NCCL INFO Setting affinity for GPU 0 to 1fc07f PHLRR4036:24142:24263 [3] NCCL INFO Setting affinity for GPU 3 to 1fc07f PHLRR4036:24140:24267 [1] NCCL INFO Setting affinity for GPU 1 to 1fc07f PHLRR4036:24150:24264 [7] NCCL INFO Setting affinity for GPU 7 to 0fe03f80 PHLRR4036:24144:24272 [4] NCCL INFO Setting affinity for GPU 4 to 0fe03f80 PHLRR4036:24139:24253 [0] NCCL INFO Channel 00/02 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 PHLRR4036:24141:24258 [2] NCCL INFO Trees [0] 3/10/-1->2->-1 [1] 3/-1/-1->2->10 PHLRR4036:24140:24267 [1] NCCL INFO Trees [0] 4/-1/-1->1->0 [1] 4/-1/-1->1->0 PHLRR4036:24150:24264 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 PHLRR4036:24148:24269 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 PHLRR4036:24146:24268 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 PHLRR4036:24142:24263 [3] NCCL INFO Trees [0] 0/-1/-1->3->2 [1] 0/-1/-1->3->2 PHLRR4036:24144:24272 [4] NCCL INFO Trees [0] 5/-1/-1->4->1 [1] 5/-1/-1->4->1 PHLRR4036:24139:24253 [0] NCCL INFO Channel 01/02 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 PHLRR4036:24139:24253 [0] NCCL INFO Trees [0] 1/-1/-1->0->3 [1] 1/-1/-1->0->3 PHLRR4036:24144:24272 [4] NCCL INFO Channel 00/0 : 3[13000] -> 4[83000] [receive] via NET/IBext/1/GDRDMA PHLRR4036:24139:24253 [0] NCCL INFO Channel 00/0 : 15[91000] -> 0[5000] [receive] via NET/IBext/0 PHLRR4036:24140:24267 [1] NCCL INFO Channel 00 : 1[8000] -> 2[d000] via SHM/direct/direct PHLRR4036:24146:24268 [5] NCCL INFO Channel 00 : 5[89000] -> 6[8e000] via SHM/direct/direct PHLRR4036:24140:24267 [1] NCCL INFO Channel 01 : 1[8000] -> 2[d000] via SHM/direct/direct PHLRR4036:24146:24268 [5] NCCL INFO Channel 01 : 5[89000] -> 6[8e000] via SHM/direct/direct PHLRR4036:24141:24258 [2] NCCL INFO Channel 00/0 : 2[d000] -> 3[13000] via P2P/IPC PHLRR4036:24148:24269 [6] NCCL INFO Channel 00/0 : 6[8e000] -> 7[91000] via P2P/IPC PHLRR4036:24141:24258 [2] NCCL INFO Channel 01/0 : 2[d000] -> 3[13000] via P2P/IPC PHLRR4036:24148:24269 [6] NCCL INFO Channel 01/0 : 6[8e000] -> 7[91000] via P2P/IPC PHLRR4036:24150:24264 [7] NCCL INFO Channel 00/0 : 7[91000] -> 8[5000] [send] via NET/IBext/1 PHLRR4036:24142:24263 [3] NCCL INFO Channel 00/0 : 3[13000] -> 4[83000] [send] via NET/IBext/1 PHLRR4036:24139:24253 [0] NCCL INFO Channel 01/0 : 15[91000] -> 0[5000] [receive] via NET/IBext/0 PHLRR4036:24144:24272 [4] NCCL INFO Channel 01/0 : 3[13000] -> 4[83000] [receive] via NET/IBext/1/GDRDMA PHLRR4036:24139:24253 [0] NCCL INFO Channel 00/0 : 0[5000] -> 1[8000] via P2P/IPC PHLRR4036:24144:24272 [4] NCCL INFO Channel 00/0 : 4[83000] -> 5[89000] via P2P/IPC PHLRR4036:24139:24253 [0] NCCL INFO Channel 01/0 : 0[5000] -> 1[8000] via P2P/IPC PHLRR4036:24144:24272 [4] NCCL INFO Channel 01/0 : 4[83000] -> 5[89000] via P2P/IPC PHLRR4036:24140:24267 [1] NCCL INFO Connected all rings PHLRR4036:24140:24267 [1] NCCL INFO Channel 00 : 1[8000] -> 4[83000] via SHM/direct/direct PHLRR4036:24140:24267 [1] NCCL INFO Channel 01 : 1[8000] -> 4[83000] via SHM/direct/direct PHLRR4036:24146:24268 [5] NCCL INFO Connected all rings PHLRR4036:24146:24268 [5] NCCL INFO Channel 00/0 : 5[89000] -> 4[83000] via P2P/IPC PHLRR4036:24146:24268 [5] NCCL INFO Channel 01/0 : 5[89000] -> 4[83000] via P2P/IPC PHLRR4036:24150:24264 [7] NCCL INFO Channel 01/0 : 7[91000] -> 8[5000] [send] via NET/IBext/1 PHLRR4036:24139:24287 [0] NCCL INFO NCCL_IB_TIMEOUT set by environment to 300. PHLRR4036:24142:24263 [3] NCCL INFO Channel 01/0 : 3[13000] -> 4[83000] [send] via NET/IBext/1 PHLRR4036:24148:24269 [6] NCCL INFO Connected all rings PHLRR4036:24148:24269 [6] NCCL INFO Channel 00 : 6[8e000] -> 5[89000] via SHM/direct/direct PHLRR4036:24148:24269 [6] NCCL INFO Channel 01 : 6[8e000] -> 5[89000] via SHM/direct/direct PHLRR4036:24141:24258 [2] NCCL INFO Connected all rings PHLRR4036:24144:24291 [4] NCCL INFO NCCL_IB_TIMEOUT set by environment to 300. PHLRR4036:24141:24258 [2] NCCL INFO Channel 00/0 : 10[d000] -> 2[d000] [receive] via NET/IBext/0/GDRDMA PHLRR4036:24139:24253 [0] NCCL INFO Connected all rings PHLRR4036:24150:24264 [7] NCCL INFO Connected all rings PHLRR4036:24150:24264 [7] NCCL INFO Channel 00/0 : 7[91000] -> 6[8e000] via P2P/IPC PHLRR4036:24139:24253 [0] NCCL INFO Channel 00 : 0[5000] -> 3[13000] via SHM/direct/direct PHLRR4036:24142:24263 [3] NCCL INFO Connected all rings PHLRR4036:24139:24253 [0] NCCL INFO Channel 01 : 0[5000] -> 3[13000] via SHM/direct/direct PHLRR4036:24150:24264 [7] NCCL INFO Channel 01/0 : 7[91000] -> 6[8e000] via P2P/IPC PHLRR4036:24150:24264 [7] NCCL INFO Connected all trees PHLRR4036:24150:24264 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 PHLRR4036:24150:24264 [7] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer PHLRR4036:24148:24269 [6] NCCL INFO Connected all trees PHLRR4036:24148:24269 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 PHLRR4036:24148:24269 [6] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer PHLRR4036:24141:24258 [2] NCCL INFO Channel 01/0 : 10[d000] -> 2[d000] [receive] via NET/IBext/0/GDRDMA PHLRR4036:24142:24263 [3] NCCL INFO Channel 00 : 3[13000] -> 0[5000] via SHM/direct/direct PHLRR4036:24142:24263 [3] NCCL INFO Channel 01 : 3[13000] -> 0[5000] via SHM/direct/direct PHLRR4036:24144:24272 [4] NCCL INFO Connected all rings PHLRR4036:24144:24272 [4] NCCL INFO Channel 00 : 4[83000] -> 1[8000] via SHM/direct/direct PHLRR4036:24144:24272 [4] NCCL INFO Channel 01 : 4[83000] -> 1[8000] via SHM/direct/direct PHLRR4036:24142:24263 [3] NCCL INFO Channel 00/0 : 3[13000] -> 2[d000] via P2P/IPC PHLRR4036:24142:24263 [3] NCCL INFO Channel 01/0 : 3[13000] -> 2[d000] via P2P/IPC PHLRR4036:24141:24258 [2] NCCL INFO Channel 00/0 : 2[d000] -> 10[d000] [send] via NET/IBext/0 PHLRR4036:24140:24267 [1] NCCL INFO Channel 00/0 : 1[8000] -> 0[5000] via P2P/IPC PHLRR4036:24140:24267 [1] NCCL INFO Channel 01/0 : 1[8000] -> 0[5000] via P2P/IPC PHLRR4036:24139:24253 [0] NCCL INFO Connected all trees PHLRR4036:24140:24267 [1] NCCL INFO Connected all trees PHLRR4036:24140:24267 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 PHLRR4036:24139:24253 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 PHLRR4036:24139:24253 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer PHLRR4036:24140:24267 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer PHLRR4036:24144:24272 [4] NCCL INFO Connected all trees PHLRR4036:24144:24272 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 PHLRR4036:24144:24272 [4] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer PHLRR4036:24146:24268 [5] NCCL INFO Connected all trees PHLRR4036:24146:24268 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 PHLRR4036:24146:24268 [5] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer PHLRR4036:24141:24258 [2] NCCL INFO Channel 01/0 : 2[d000] -> 10[d000] [send] via NET/IBext/0 PHLRR4036:24141:24285 [2] NCCL INFO NCCL_IB_TIMEOUT set by environment to 300. PHLRR4036:24141:24258 [2] NCCL INFO Connected all trees PHLRR4036:24141:24258 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 PHLRR4036:24141:24258 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer PHLRR4036:24142:24263 [3] NCCL INFO Connected all trees PHLRR4036:24142:24263 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 PHLRR4036:24142:24263 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer PHLRR4036:24141:24258 [2] NCCL INFO comm 0x55ee705f7690 rank 2 nranks 16 cudaDev 2 busId d000 - Init COMPLETE PHLRR4036:24144:24272 [4] NCCL INFO comm 0x564611e72a30 rank 4 nranks 16 cudaDev 4 busId 83000 - Init COMPLETE PHLRR4036:24146:24268 [5] NCCL INFO comm 0x55822f760a60 rank 5 nranks 16 cudaDev 5 busId 89000 - Init COMPLETE PHLRR4036:24150:24264 [7] NCCL INFO comm 0x555644620380 rank 7 nranks 16 cudaDev 7 busId 91000 - Init COMPLETE PHLRR4036:24140:24267 [1] NCCL INFO comm 0x5612afe990e0 rank 1 nranks 16 cudaDev 1 busId 8000 - Init COMPLETE PHLRR4036:24142:24263 [3] NCCL INFO comm 0x5562191160f0 rank 3 nranks 16 cudaDev 3 busId 13000 - Init COMPLETE PHLRR4036:24148:24269 [6] NCCL INFO comm 0x55cede6d3870 rank 6 nranks 16 cudaDev 6 busId 8e000 - Init COMPLETE PHLRR4036:24139:24253 [0] NCCL INFO comm 0x5634028d2ec0 rank 0 nranks 16 cudaDev 0 busId 5000 - Init COMPLETE

PHLRR4036:24141:24296 [2] ib_plugin.c:978 NCCL WARN NET/IB : Got completion from peer 10.226.98.98<48598> with error 12, opcode 0, len 0, vendor err 129 PHLRR4036:24141:24296 [2] NCCL INFO include/net.h:35 -> 2 PHLRR4036:24141:24296 [2] NCCL INFO transport/net.cc:1034 -> 2 PHLRR4036:24141:24296 [2] NCCL INFO proxy.cc:520 -> 2 PHLRR4036:24141:24296 [2] NCCL INFO proxy.cc:684 -> 2 [Proxy Thread] [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): NCCL error: unhandled system error, NCCL version 2.14.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. Last error: NET/IB : Got completion from peer 10.226.98.98<48598> with error 12, opcode 0, len 0, vendor err 129 WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 24139 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 24140 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 24142 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 24144 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 24146 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 24148 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 24150 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 2 (pid: 24141) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.13.0a0+d0d6b1f', 'console_scripts', 'torchrun')()) File "/home/lambda7xx/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/home/lambda7xx/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/lambda7xx/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/lambda7xx/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/lambda7xx/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

bloom-inference-scripts/bloom-ds-inference.py FAILED

Failures:

------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-03-03_02:22:01 host : PHLRR4036.corp.microsoft.com rank : 2 (local_rank: 2) exitcode : -6 (pid: 24141) error_file: traceback : Signal 6 (SIGABRT) received by PID 24141 ====================================================== ``` - Node 1 ``` WARNING:torch.distributed.run: ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** PHLRR3088:25515:25515 [2] NCCL INFO cudaDriverVersion 11080 PHLRR3088:25526:25526 [7] NCCL INFO cudaDriverVersion 11080 PHLRR3088:25517:25517 [3] NCCL INFO cudaDriverVersion 11080 PHLRR3088:25523:25523 [6] NCCL INFO cudaDriverVersion 11080 PHLRR3088:25513:25513 [0] NCCL INFO cudaDriverVersion 11080 PHLRR3088:25521:25521 [5] NCCL INFO cudaDriverVersion 11080 PHLRR3088:25514:25514 [1] NCCL INFO cudaDriverVersion 11080 PHLRR3088:25519:25519 [4] NCCL INFO cudaDriverVersion 11080 PHLRR3088:25513:25513 [0] NCCL INFO Bootstrap : Using ens9f1:10.226.98.98<0> PHLRR3088:25521:25521 [5] NCCL INFO Bootstrap : Using ens9f1:10.226.98.98<0> PHLRR3088:25513:25513 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. PHLRR3088:25513:25513 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) PHLRR3088:25513:25513 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. PHLRR3088:25513:25513 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) PHLRR3088:25517:25517 [3] NCCL INFO Bootstrap : Using ens9f1:10.226.98.98<0> PHLRR3088:25521:25521 [5] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. PHLRR3088:25521:25521 [5] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) PHLRR3088:25521:25521 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. PHLRR3088:25521:25521 [5] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) PHLRR3088:25513:25585 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so PHLRR3088:25513:25585 [0] NCCL INFO P2P plugin IBext PHLRR3088:25521:25586 [5] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so PHLRR3088:25521:25586 [5] NCCL INFO P2P plugin IBext PHLRR3088:25517:25517 [3] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. PHLRR3088:25517:25517 [3] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) PHLRR3088:25517:25517 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. PHLRR3088:25517:25517 [3] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) PHLRR3088:25523:25523 [6] NCCL INFO Bootstrap : Using ens9f1:10.226.98.98<0> PHLRR3088:25517:25590 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so PHLRR3088:25517:25590 [3] NCCL INFO P2P plugin IBext PHLRR3088:25526:25526 [7] NCCL INFO Bootstrap : Using ens9f1:10.226.98.98<0> PHLRR3088:25515:25515 [2] NCCL INFO Bootstrap : Using ens9f1:10.226.98.98<0> PHLRR3088:25513:25585 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.98.98<0> PHLRR3088:25513:25585 [0] NCCL INFO Using network IBext PHLRR3088:25523:25523 [6] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. PHLRR3088:25523:25523 [6] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) PHLRR3088:25523:25523 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. PHLRR3088:25523:25523 [6] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) PHLRR3088:25521:25586 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.98.98<0> PHLRR3088:25521:25586 [5] NCCL INFO Using network IBext PHLRR3088:25515:25515 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. PHLRR3088:25515:25515 [2] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) PHLRR3088:25515:25515 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. PHLRR3088:25515:25515 [2] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) PHLRR3088:25523:25596 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so PHLRR3088:25523:25596 [6] NCCL INFO P2P plugin IBext PHLRR3088:25526:25526 [7] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. PHLRR3088:25526:25526 [7] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) PHLRR3088:25526:25526 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. PHLRR3088:25526:25526 [7] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) PHLRR3088:25514:25514 [1] NCCL INFO Bootstrap : Using ens9f1:10.226.98.98<0> PHLRR3088:25519:25519 [4] NCCL INFO Bootstrap : Using ens9f1:10.226.98.98<0> PHLRR3088:25517:25590 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.98.98<0> PHLRR3088:25517:25590 [3] NCCL INFO Using network IBext PHLRR3088:25515:25600 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so PHLRR3088:25515:25600 [2] NCCL INFO P2P plugin IBext PHLRR3088:25526:25601 [7] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so PHLRR3088:25526:25601 [7] NCCL INFO P2P plugin IBext PHLRR3088:25514:25514 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. PHLRR3088:25514:25514 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) PHLRR3088:25514:25514 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. PHLRR3088:25514:25514 [1] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) PHLRR3088:25519:25519 [4] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. PHLRR3088:25519:25519 [4] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) PHLRR3088:25519:25519 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. PHLRR3088:25519:25519 [4] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) PHLRR3088:25523:25596 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.98.98<0> PHLRR3088:25523:25596 [6] NCCL INFO Using network IBext PHLRR3088:25514:25607 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so PHLRR3088:25514:25607 [1] NCCL INFO P2P plugin IBext PHLRR3088:25519:25608 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so PHLRR3088:25519:25608 [4] NCCL INFO P2P plugin IBext PHLRR3088:25515:25600 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.98.98<0> PHLRR3088:25515:25600 [2] NCCL INFO Using network IBext PHLRR3088:25526:25601 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.98.98<0> PHLRR3088:25526:25601 [7] NCCL INFO Using network IBext PHLRR3088:25514:25607 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.98.98<0> PHLRR3088:25514:25607 [1] NCCL INFO Using network IBext PHLRR3088:25519:25608 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.98.98<0> PHLRR3088:25519:25608 [4] NCCL INFO Using network IBext PHLRR3088:25514:25607 [1] NCCL INFO Setting affinity for GPU 1 to 1fc07f PHLRR3088:25515:25600 [2] NCCL INFO Setting affinity for GPU 2 to 1fc07f PHLRR3088:25513:25585 [0] NCCL INFO Setting affinity for GPU 0 to 1fc07f PHLRR3088:25521:25586 [5] NCCL INFO Setting affinity for GPU 5 to 0fe03f80 PHLRR3088:25517:25590 [3] NCCL INFO Setting affinity for GPU 3 to 1fc07f PHLRR3088:25526:25601 [7] NCCL INFO Setting affinity for GPU 7 to 0fe03f80 PHLRR3088:25519:25608 [4] NCCL INFO Setting affinity for GPU 4 to 0fe03f80 PHLRR3088:25523:25596 [6] NCCL INFO Setting affinity for GPU 6 to 0fe03f80 PHLRR3088:25526:25601 [7] NCCL INFO Trees [0] -1/-1/-1->15->14 [1] -1/-1/-1->15->14 PHLRR3088:25523:25596 [6] NCCL INFO Trees [0] 15/-1/-1->14->13 [1] 15/-1/-1->14->13 PHLRR3088:25521:25586 [5] NCCL INFO Trees [0] 14/-1/-1->13->12 [1] 14/-1/-1->13->12 PHLRR3088:25519:25608 [4] NCCL INFO Trees [0] 13/-1/-1->12->9 [1] 13/-1/-1->12->9 PHLRR3088:25515:25600 [2] NCCL INFO Trees [0] 11/-1/-1->10->2 [1] 11/2/-1->10->-1 PHLRR3088:25514:25607 [1] NCCL INFO Trees [0] 12/-1/-1->9->8 [1] 12/-1/-1->9->8 PHLRR3088:25517:25590 [3] NCCL INFO Trees [0] 8/-1/-1->11->10 [1] 8/-1/-1->11->10 PHLRR3088:25513:25585 [0] NCCL INFO Trees [0] 9/-1/-1->8->11 [1] 9/-1/-1->8->11 PHLRR3088:25514:25607 [1] NCCL INFO Channel 00 : 9[8000] -> 10[d000] via SHM/direct/direct PHLRR3088:25514:25607 [1] NCCL INFO Channel 01 : 9[8000] -> 10[d000] via SHM/direct/direct PHLRR3088:25519:25608 [4] NCCL INFO Channel 00/0 : 11[13000] -> 12[83000] [receive] via NET/IBext/1/GDRDMA PHLRR3088:25513:25585 [0] NCCL INFO Channel 00/0 : 7[91000] -> 8[5000] [receive] via NET/IBext/0 PHLRR3088:25521:25586 [5] NCCL INFO Channel 00 : 13[89000] -> 14[8e000] via SHM/direct/direct PHLRR3088:25521:25586 [5] NCCL INFO Channel 01 : 13[89000] -> 14[8e000] via SHM/direct/direct PHLRR3088:25523:25596 [6] NCCL INFO Channel 00/0 : 14[8e000] -> 15[91000] via P2P/IPC PHLRR3088:25515:25600 [2] NCCL INFO Channel 00/0 : 10[d000] -> 11[13000] via P2P/IPC PHLRR3088:25523:25596 [6] NCCL INFO Channel 01/0 : 14[8e000] -> 15[91000] via P2P/IPC PHLRR3088:25515:25600 [2] NCCL INFO Channel 01/0 : 10[d000] -> 11[13000] via P2P/IPC PHLRR3088:25526:25601 [7] NCCL INFO Channel 00/0 : 15[91000] -> 0[5000] [send] via NET/IBext/1 PHLRR3088:25517:25590 [3] NCCL INFO Channel 00/0 : 11[13000] -> 12[83000] [send] via NET/IBext/1 PHLRR3088:25519:25608 [4] NCCL INFO Channel 01/0 : 11[13000] -> 12[83000] [receive] via NET/IBext/1/GDRDMA PHLRR3088:25519:25608 [4] NCCL INFO Channel 00/0 : 12[83000] -> 13[89000] via P2P/IPC PHLRR3088:25519:25608 [4] NCCL INFO Channel 01/0 : 12[83000] -> 13[89000] via P2P/IPC PHLRR3088:25513:25585 [0] NCCL INFO Channel 01/0 : 7[91000] -> 8[5000] [receive] via NET/IBext/0 PHLRR3088:25513:25585 [0] NCCL INFO Channel 00/0 : 8[5000] -> 9[8000] via P2P/IPC PHLRR3088:25513:25585 [0] NCCL INFO Channel 01/0 : 8[5000] -> 9[8000] via P2P/IPC PHLRR3088:25521:25586 [5] NCCL INFO Connected all rings PHLRR3088:25514:25607 [1] NCCL INFO Connected all rings PHLRR3088:25514:25607 [1] NCCL INFO Channel 00 : 9[8000] -> 12[83000] via SHM/direct/direct PHLRR3088:25514:25607 [1] NCCL INFO Channel 01 : 9[8000] -> 12[83000] via SHM/direct/direct PHLRR3088:25521:25586 [5] NCCL INFO Channel 00/0 : 13[89000] -> 12[83000] via P2P/IPC PHLRR3088:25521:25586 [5] NCCL INFO Channel 01/0 : 13[89000] -> 12[83000] via P2P/IPC PHLRR3088:25526:25601 [7] NCCL INFO Channel 01/0 : 15[91000] -> 0[5000] [send] via NET/IBext/1 PHLRR3088:25517:25590 [3] NCCL INFO Channel 01/0 : 11[13000] -> 12[83000] [send] via NET/IBext/1 PHLRR3088:25523:25596 [6] NCCL INFO Connected all rings PHLRR3088:25513:25624 [0] NCCL INFO NCCL_IB_TIMEOUT set by environment to 300. PHLRR3088:25515:25600 [2] NCCL INFO Connected all rings PHLRR3088:25519:25626 [4] NCCL INFO NCCL_IB_TIMEOUT set by environment to 300. PHLRR3088:25515:25600 [2] NCCL INFO Channel 00/0 : 2[d000] -> 10[d000] [receive] via NET/IBext/0/GDRDMA PHLRR3088:25526:25601 [7] NCCL INFO Connected all rings PHLRR3088:25526:25601 [7] NCCL INFO Channel 00/0 : 15[91000] -> 14[8e000] via P2P/IPC PHLRR3088:25526:25601 [7] NCCL INFO Channel 01/0 : 15[91000] -> 14[8e000] via P2P/IPC PHLRR3088:25523:25596 [6] NCCL INFO Channel 00 : 14[8e000] -> 13[89000] via SHM/direct/direct PHLRR3088:25513:25585 [0] NCCL INFO Connected all rings PHLRR3088:25517:25590 [3] NCCL INFO Connected all rings PHLRR3088:25523:25596 [6] NCCL INFO Channel 01 : 14[8e000] -> 13[89000] via SHM/direct/direct PHLRR3088:25513:25585 [0] NCCL INFO Channel 00 : 8[5000] -> 11[13000] via SHM/direct/direct PHLRR3088:25513:25585 [0] NCCL INFO Channel 01 : 8[5000] -> 11[13000] via SHM/direct/direct PHLRR3088:25526:25601 [7] NCCL INFO Connected all trees PHLRR3088:25526:25601 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 PHLRR3088:25526:25601 [7] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer PHLRR3088:25523:25596 [6] NCCL INFO Connected all trees PHLRR3088:25523:25596 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 PHLRR3088:25523:25596 [6] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer PHLRR3088:25515:25600 [2] NCCL INFO Channel 01/0 : 2[d000] -> 10[d000] [receive] via NET/IBext/0/GDRDMA PHLRR3088:25519:25608 [4] NCCL INFO Connected all rings PHLRR3088:25517:25590 [3] NCCL INFO Channel 00 : 11[13000] -> 8[5000] via SHM/direct/direct PHLRR3088:25517:25590 [3] NCCL INFO Channel 01 : 11[13000] -> 8[5000] via SHM/direct/direct PHLRR3088:25519:25608 [4] NCCL INFO Channel 00 : 12[83000] -> 9[8000] via SHM/direct/direct PHLRR3088:25519:25608 [4] NCCL INFO Channel 01 : 12[83000] -> 9[8000] via SHM/direct/direct PHLRR3088:25517:25590 [3] NCCL INFO Channel 00/0 : 11[13000] -> 10[d000] via P2P/IPC PHLRR3088:25517:25590 [3] NCCL INFO Channel 01/0 : 11[13000] -> 10[d000] via P2P/IPC PHLRR3088:25515:25600 [2] NCCL INFO Channel 00/0 : 10[d000] -> 2[d000] [send] via NET/IBext/0 PHLRR3088:25514:25607 [1] NCCL INFO Channel 00/0 : 9[8000] -> 8[5000] via P2P/IPC PHLRR3088:25514:25607 [1] NCCL INFO Channel 01/0 : 9[8000] -> 8[5000] via P2P/IPC PHLRR3088:25513:25585 [0] NCCL INFO Connected all trees PHLRR3088:25513:25585 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 PHLRR3088:25513:25585 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer PHLRR3088:25514:25607 [1] NCCL INFO Connected all trees PHLRR3088:25514:25607 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 PHLRR3088:25514:25607 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer PHLRR3088:25519:25608 [4] NCCL INFO Connected all trees PHLRR3088:25519:25608 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 PHLRR3088:25519:25608 [4] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer PHLRR3088:25521:25586 [5] NCCL INFO Connected all trees PHLRR3088:25521:25586 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 PHLRR3088:25521:25586 [5] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer PHLRR3088:25515:25600 [2] NCCL INFO Channel 01/0 : 10[d000] -> 2[d000] [send] via NET/IBext/0 PHLRR3088:25515:25622 [2] NCCL INFO NCCL_IB_TIMEOUT set by environment to 300. PHLRR3088:25515:25600 [2] NCCL INFO Connected all trees PHLRR3088:25515:25600 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 PHLRR3088:25515:25600 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer PHLRR3088:25517:25590 [3] NCCL INFO Connected all trees PHLRR3088:25517:25590 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 PHLRR3088:25517:25590 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer PHLRR3088:25515:25600 [2] NCCL INFO comm 0x5614942e7890 rank 10 nranks 16 cudaDev 2 busId d000 - Init COMPLETE PHLRR3088:25517:25590 [3] NCCL INFO comm 0x55b9aef1d5c0 rank 11 nranks 16 cudaDev 3 busId 13000 - Init COMPLETE PHLRR3088:25519:25608 [4] NCCL INFO comm 0x55cdde437fd0 rank 12 nranks 16 cudaDev 4 busId 83000 - Init COMPLETE PHLRR3088:25526:25601 [7] NCCL INFO comm 0x55c8fb11f310 rank 15 nranks 16 cudaDev 7 busId 91000 - Init COMPLETE PHLRR3088:25513:25585 [0] NCCL INFO comm 0x561baa3cd5c0 rank 8 nranks 16 cudaDev 0 busId 5000 - Init COMPLETE PHLRR3088:25514:25607 [1] NCCL INFO comm 0x560b6c952590 rank 9 nranks 16 cudaDev 1 busId 8000 - Init COMPLETE PHLRR3088:25523:25596 [6] NCCL INFO comm 0x563f29166010 rank 14 nranks 16 cudaDev 6 busId 8e000 - Init COMPLETE PHLRR3088:25521:25586 [5] NCCL INFO comm 0x5582407e5d50 rank 13 nranks 16 cudaDev 5 busId 89000 - Init COMPLETE PHLRR3088:25515:25633 [2] socket.c:505 NCCL WARN Net : Call to recv from 10.226.99.46<45256> failed : Connection reset by peer PHLRR3088:25515:25633 [2] NCCL INFO socket.c:522 -> 2 PHLRR3088:25515:25633 [2] NCCL INFO ib_plugin.c:584 -> 2 PHLRR3088:25515:25633 [2] NCCL INFO ib_plugin.c:883 -> 2 PHLRR3088:25515:25633 [2] NCCL INFO include/net.h:33 -> 2 PHLRR3088:25515:25633 [2] NCCL INFO transport/net.cc:1018 -> 2 PHLRR3088:25515:25633 [2] NCCL INFO proxy.cc:520 -> 2 PHLRR3088:25515:25633 [2] NCCL INFO proxy.cc:684 -> 2 [Proxy Thread] [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): NCCL error: unhandled system error, NCCL version 2.15.1 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 25513 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 25514 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 25517 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 25519 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 25521 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 25523 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 25526 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 2 (pid: 25515) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.13.0a0+d0d6b1f', 'console_scripts', 'torchrun')()) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ====================================================== bloom-inference-scripts/bloom-ds-inference.py FAILED ------------------------------------------------------ Failures: ------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-03-03_02:22:01 host : PHLRR3088.corp.microsoft.com rank : 10 (local_rank: 2) exitcode : -6 (pid: 25515) error_file: traceback : Signal 6 (SIGABRT) received by PID 25515 ====================================================== ```
ayushmodi-038 commented 1 year ago

Any updates? I have the same issue?

santapo commented 1 year ago

same here

HUAFOR commented 11 months ago

Any updates? I have the same issue? some NCCL operations have failed or timed out

mxdlzg commented 5 months ago

Same error, any updates?