Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2023-03-03 02:21:44,523] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
*** Loading the model bigscience/bloom
Fetching 197 files: 0%| | 0/197 [00:00<?, ?it/s]
Fetching 197 files: 100%|██████████| 197/197 [00:00<00:00, 4156.77it/s]
PHLRR4036:24139:24139 [0] NCCL INFO Bootstrap : Using ens9f1:10.226.99.46<0>
PHLRR4036:24139:24139 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
PHLRR4036:24139:24139 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
PHLRR4036:24139:24139 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
PHLRR4036:24139:24139 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
[1677810113.706248] [PHLRR4036:24139:f] vfs_fuse.c:281 UCX ERROR inotify_add_watch(/tmp) failed: No space left on device
PHLRR4036:24139:24139 [0] NCCL INFO cudaDriverVersion 11080
NCCL version 2.14.3+cuda11.7
PHLRR4036:24141:24141 [2] NCCL INFO cudaDriverVersion 11080
PHLRR4036:24150:24150 [7] NCCL INFO cudaDriverVersion 11080
PHLRR4036:24148:24148 [6] NCCL INFO cudaDriverVersion 11080
PHLRR4036:24146:24146 [5] NCCL INFO cudaDriverVersion 11080
PHLRR4036:24140:24140 [1] NCCL INFO cudaDriverVersion 11080
PHLRR4036:24142:24142 [3] NCCL INFO cudaDriverVersion 11080
PHLRR4036:24139:24253 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
PHLRR4036:24139:24253 [0] NCCL INFO P2P plugin IBext
PHLRR4036:24144:24144 [4] NCCL INFO cudaDriverVersion 11080
PHLRR4036:24141:24141 [2] NCCL INFO Bootstrap : Using ens9f1:10.226.99.46<0>
PHLRR4036:24142:24142 [3] NCCL INFO Bootstrap : Using ens9f1:10.226.99.46<0>
PHLRR4036:24150:24150 [7] NCCL INFO Bootstrap : Using ens9f1:10.226.99.46<0>
PHLRR4036:24140:24140 [1] NCCL INFO Bootstrap : Using ens9f1:10.226.99.46<0>
PHLRR4036:24146:24146 [5] NCCL INFO Bootstrap : Using ens9f1:10.226.99.46<0>
PHLRR4036:24148:24148 [6] NCCL INFO Bootstrap : Using ens9f1:10.226.99.46<0>
PHLRR4036:24141:24141 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
PHLRR4036:24141:24141 [2] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
PHLRR4036:24141:24141 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
PHLRR4036:24141:24141 [2] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
[1677810113.720920] [PHLRR4036:24141:f] vfs_fuse.c:281 UCX ERROR inotify_add_watch(/tmp) failed: No space left on device
PHLRR4036:24142:24142 [3] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
PHLRR4036:24142:24142 [3] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
PHLRR4036:24142:24142 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
PHLRR4036:24142:24142 [3] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
PHLRR4036:24150:24150 [7] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
PHLRR4036:24150:24150 [7] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
PHLRR4036:24150:24150 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
PHLRR4036:24150:24150 [7] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
[1677810113.723766] [PHLRR4036:24142:f] vfs_fuse.c:281 UCX ERROR inotify_add_watch(/tmp) failed: No space left on device
PHLRR4036:24141:24258 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
[1677810113.723912] [PHLRR4036:24150:f] vfs_fuse.c:281 UCX ERROR inotify_add_watch(/tmp) failed: No space left on device
PHLRR4036:24141:24258 [2] NCCL INFO P2P plugin IBext
PHLRR4036:24144:24144 [4] NCCL INFO Bootstrap : Using ens9f1:10.226.99.46<0>
PHLRR4036:24140:24140 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
PHLRR4036:24140:24140 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
PHLRR4036:24140:24140 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
PHLRR4036:24140:24140 [1] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
[1677810113.725985] [PHLRR4036:24140:f] vfs_fuse.c:281 UCX ERROR inotify_add_watch(/tmp) failed: No space left on device
PHLRR4036:24148:24148 [6] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
PHLRR4036:24148:24148 [6] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
PHLRR4036:24148:24148 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
PHLRR4036:24148:24148 [6] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
PHLRR4036:24146:24146 [5] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
PHLRR4036:24146:24146 [5] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
PHLRR4036:24146:24146 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
PHLRR4036:24146:24146 [5] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
[1677810113.726464] [PHLRR4036:24148:f] vfs_fuse.c:281 UCX ERROR inotify_add_watch(/tmp) failed: No space left on device
[1677810113.726557] [PHLRR4036:24146:f] vfs_fuse.c:281 UCX ERROR inotify_add_watch(/tmp) failed: No space left on device
PHLRR4036:24142:24263 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
PHLRR4036:24142:24263 [3] NCCL INFO P2P plugin IBext
PHLRR4036:24150:24264 [7] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
PHLRR4036:24150:24264 [7] NCCL INFO P2P plugin IBext
PHLRR4036:24144:24144 [4] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
PHLRR4036:24144:24144 [4] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
PHLRR4036:24144:24144 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
PHLRR4036:24144:24144 [4] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
[1677810113.729624] [PHLRR4036:24144:f] vfs_fuse.c:281 UCX ERROR inotify_add_watch(/tmp) failed: No space left on device
PHLRR4036:24139:24253 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.99.46<0>
PHLRR4036:24140:24267 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
PHLRR4036:24140:24267 [1] NCCL INFO P2P plugin IBext
PHLRR4036:24139:24253 [0] NCCL INFO Using network IBext
PHLRR4036:24146:24268 [5] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
PHLRR4036:24146:24268 [5] NCCL INFO P2P plugin IBext
PHLRR4036:24148:24269 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
PHLRR4036:24148:24269 [6] NCCL INFO P2P plugin IBext
PHLRR4036:24144:24272 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
PHLRR4036:24144:24272 [4] NCCL INFO P2P plugin IBext
PHLRR4036:24141:24258 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.99.46<0>
PHLRR4036:24141:24258 [2] NCCL INFO Using network IBext
PHLRR4036:24142:24263 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.99.46<0>
PHLRR4036:24142:24263 [3] NCCL INFO Using network IBext
PHLRR4036:24150:24264 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.99.46<0>
PHLRR4036:24150:24264 [7] NCCL INFO Using network IBext
PHLRR4036:24146:24268 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.99.46<0>
PHLRR4036:24146:24268 [5] NCCL INFO Using network IBext
PHLRR4036:24148:24269 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.99.46<0>
PHLRR4036:24148:24269 [6] NCCL INFO Using network IBext
PHLRR4036:24140:24267 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.99.46<0>
PHLRR4036:24140:24267 [1] NCCL INFO Using network IBext
PHLRR4036:24144:24272 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.99.46<0>
PHLRR4036:24144:24272 [4] NCCL INFO Using network IBext
PHLRR4036:24148:24269 [6] NCCL INFO Setting affinity for GPU 6 to 0fe03f80
PHLRR4036:24141:24258 [2] NCCL INFO Setting affinity for GPU 2 to 1fc07f
PHLRR4036:24146:24268 [5] NCCL INFO Setting affinity for GPU 5 to 0fe03f80
PHLRR4036:24139:24253 [0] NCCL INFO Setting affinity for GPU 0 to 1fc07f
PHLRR4036:24142:24263 [3] NCCL INFO Setting affinity for GPU 3 to 1fc07f
PHLRR4036:24140:24267 [1] NCCL INFO Setting affinity for GPU 1 to 1fc07f
PHLRR4036:24150:24264 [7] NCCL INFO Setting affinity for GPU 7 to 0fe03f80
PHLRR4036:24144:24272 [4] NCCL INFO Setting affinity for GPU 4 to 0fe03f80
PHLRR4036:24139:24253 [0] NCCL INFO Channel 00/02 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
PHLRR4036:24141:24258 [2] NCCL INFO Trees [0] 3/10/-1->2->-1 [1] 3/-1/-1->2->10
PHLRR4036:24140:24267 [1] NCCL INFO Trees [0] 4/-1/-1->1->0 [1] 4/-1/-1->1->0
PHLRR4036:24150:24264 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6
PHLRR4036:24148:24269 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5
PHLRR4036:24146:24268 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4
PHLRR4036:24142:24263 [3] NCCL INFO Trees [0] 0/-1/-1->3->2 [1] 0/-1/-1->3->2
PHLRR4036:24144:24272 [4] NCCL INFO Trees [0] 5/-1/-1->4->1 [1] 5/-1/-1->4->1
PHLRR4036:24139:24253 [0] NCCL INFO Channel 01/02 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
PHLRR4036:24139:24253 [0] NCCL INFO Trees [0] 1/-1/-1->0->3 [1] 1/-1/-1->0->3
PHLRR4036:24144:24272 [4] NCCL INFO Channel 00/0 : 3[13000] -> 4[83000] [receive] via NET/IBext/1/GDRDMA
PHLRR4036:24139:24253 [0] NCCL INFO Channel 00/0 : 15[91000] -> 0[5000] [receive] via NET/IBext/0
PHLRR4036:24140:24267 [1] NCCL INFO Channel 00 : 1[8000] -> 2[d000] via SHM/direct/direct
PHLRR4036:24146:24268 [5] NCCL INFO Channel 00 : 5[89000] -> 6[8e000] via SHM/direct/direct
PHLRR4036:24140:24267 [1] NCCL INFO Channel 01 : 1[8000] -> 2[d000] via SHM/direct/direct
PHLRR4036:24146:24268 [5] NCCL INFO Channel 01 : 5[89000] -> 6[8e000] via SHM/direct/direct
PHLRR4036:24141:24258 [2] NCCL INFO Channel 00/0 : 2[d000] -> 3[13000] via P2P/IPC
PHLRR4036:24148:24269 [6] NCCL INFO Channel 00/0 : 6[8e000] -> 7[91000] via P2P/IPC
PHLRR4036:24141:24258 [2] NCCL INFO Channel 01/0 : 2[d000] -> 3[13000] via P2P/IPC
PHLRR4036:24148:24269 [6] NCCL INFO Channel 01/0 : 6[8e000] -> 7[91000] via P2P/IPC
PHLRR4036:24150:24264 [7] NCCL INFO Channel 00/0 : 7[91000] -> 8[5000] [send] via NET/IBext/1
PHLRR4036:24142:24263 [3] NCCL INFO Channel 00/0 : 3[13000] -> 4[83000] [send] via NET/IBext/1
PHLRR4036:24139:24253 [0] NCCL INFO Channel 01/0 : 15[91000] -> 0[5000] [receive] via NET/IBext/0
PHLRR4036:24144:24272 [4] NCCL INFO Channel 01/0 : 3[13000] -> 4[83000] [receive] via NET/IBext/1/GDRDMA
PHLRR4036:24139:24253 [0] NCCL INFO Channel 00/0 : 0[5000] -> 1[8000] via P2P/IPC
PHLRR4036:24144:24272 [4] NCCL INFO Channel 00/0 : 4[83000] -> 5[89000] via P2P/IPC
PHLRR4036:24139:24253 [0] NCCL INFO Channel 01/0 : 0[5000] -> 1[8000] via P2P/IPC
PHLRR4036:24144:24272 [4] NCCL INFO Channel 01/0 : 4[83000] -> 5[89000] via P2P/IPC
PHLRR4036:24140:24267 [1] NCCL INFO Connected all rings
PHLRR4036:24140:24267 [1] NCCL INFO Channel 00 : 1[8000] -> 4[83000] via SHM/direct/direct
PHLRR4036:24140:24267 [1] NCCL INFO Channel 01 : 1[8000] -> 4[83000] via SHM/direct/direct
PHLRR4036:24146:24268 [5] NCCL INFO Connected all rings
PHLRR4036:24146:24268 [5] NCCL INFO Channel 00/0 : 5[89000] -> 4[83000] via P2P/IPC
PHLRR4036:24146:24268 [5] NCCL INFO Channel 01/0 : 5[89000] -> 4[83000] via P2P/IPC
PHLRR4036:24150:24264 [7] NCCL INFO Channel 01/0 : 7[91000] -> 8[5000] [send] via NET/IBext/1
PHLRR4036:24139:24287 [0] NCCL INFO NCCL_IB_TIMEOUT set by environment to 300.
PHLRR4036:24142:24263 [3] NCCL INFO Channel 01/0 : 3[13000] -> 4[83000] [send] via NET/IBext/1
PHLRR4036:24148:24269 [6] NCCL INFO Connected all rings
PHLRR4036:24148:24269 [6] NCCL INFO Channel 00 : 6[8e000] -> 5[89000] via SHM/direct/direct
PHLRR4036:24148:24269 [6] NCCL INFO Channel 01 : 6[8e000] -> 5[89000] via SHM/direct/direct
PHLRR4036:24141:24258 [2] NCCL INFO Connected all rings
PHLRR4036:24144:24291 [4] NCCL INFO NCCL_IB_TIMEOUT set by environment to 300.
PHLRR4036:24141:24258 [2] NCCL INFO Channel 00/0 : 10[d000] -> 2[d000] [receive] via NET/IBext/0/GDRDMA
PHLRR4036:24139:24253 [0] NCCL INFO Connected all rings
PHLRR4036:24150:24264 [7] NCCL INFO Connected all rings
PHLRR4036:24150:24264 [7] NCCL INFO Channel 00/0 : 7[91000] -> 6[8e000] via P2P/IPC
PHLRR4036:24139:24253 [0] NCCL INFO Channel 00 : 0[5000] -> 3[13000] via SHM/direct/direct
PHLRR4036:24142:24263 [3] NCCL INFO Connected all rings
PHLRR4036:24139:24253 [0] NCCL INFO Channel 01 : 0[5000] -> 3[13000] via SHM/direct/direct
PHLRR4036:24150:24264 [7] NCCL INFO Channel 01/0 : 7[91000] -> 6[8e000] via P2P/IPC
PHLRR4036:24150:24264 [7] NCCL INFO Connected all trees
PHLRR4036:24150:24264 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
PHLRR4036:24150:24264 [7] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
PHLRR4036:24148:24269 [6] NCCL INFO Connected all trees
PHLRR4036:24148:24269 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
PHLRR4036:24148:24269 [6] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
PHLRR4036:24141:24258 [2] NCCL INFO Channel 01/0 : 10[d000] -> 2[d000] [receive] via NET/IBext/0/GDRDMA
PHLRR4036:24142:24263 [3] NCCL INFO Channel 00 : 3[13000] -> 0[5000] via SHM/direct/direct
PHLRR4036:24142:24263 [3] NCCL INFO Channel 01 : 3[13000] -> 0[5000] via SHM/direct/direct
PHLRR4036:24144:24272 [4] NCCL INFO Connected all rings
PHLRR4036:24144:24272 [4] NCCL INFO Channel 00 : 4[83000] -> 1[8000] via SHM/direct/direct
PHLRR4036:24144:24272 [4] NCCL INFO Channel 01 : 4[83000] -> 1[8000] via SHM/direct/direct
PHLRR4036:24142:24263 [3] NCCL INFO Channel 00/0 : 3[13000] -> 2[d000] via P2P/IPC
PHLRR4036:24142:24263 [3] NCCL INFO Channel 01/0 : 3[13000] -> 2[d000] via P2P/IPC
PHLRR4036:24141:24258 [2] NCCL INFO Channel 00/0 : 2[d000] -> 10[d000] [send] via NET/IBext/0
PHLRR4036:24140:24267 [1] NCCL INFO Channel 00/0 : 1[8000] -> 0[5000] via P2P/IPC
PHLRR4036:24140:24267 [1] NCCL INFO Channel 01/0 : 1[8000] -> 0[5000] via P2P/IPC
PHLRR4036:24139:24253 [0] NCCL INFO Connected all trees
PHLRR4036:24140:24267 [1] NCCL INFO Connected all trees
PHLRR4036:24140:24267 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
PHLRR4036:24139:24253 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
PHLRR4036:24139:24253 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
PHLRR4036:24140:24267 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
PHLRR4036:24144:24272 [4] NCCL INFO Connected all trees
PHLRR4036:24144:24272 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
PHLRR4036:24144:24272 [4] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
PHLRR4036:24146:24268 [5] NCCL INFO Connected all trees
PHLRR4036:24146:24268 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
PHLRR4036:24146:24268 [5] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
PHLRR4036:24141:24258 [2] NCCL INFO Channel 01/0 : 2[d000] -> 10[d000] [send] via NET/IBext/0
PHLRR4036:24141:24285 [2] NCCL INFO NCCL_IB_TIMEOUT set by environment to 300.
PHLRR4036:24141:24258 [2] NCCL INFO Connected all trees
PHLRR4036:24141:24258 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
PHLRR4036:24141:24258 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
PHLRR4036:24142:24263 [3] NCCL INFO Connected all trees
PHLRR4036:24142:24263 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
PHLRR4036:24142:24263 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
PHLRR4036:24141:24258 [2] NCCL INFO comm 0x55ee705f7690 rank 2 nranks 16 cudaDev 2 busId d000 - Init COMPLETE
PHLRR4036:24144:24272 [4] NCCL INFO comm 0x564611e72a30 rank 4 nranks 16 cudaDev 4 busId 83000 - Init COMPLETE
PHLRR4036:24146:24268 [5] NCCL INFO comm 0x55822f760a60 rank 5 nranks 16 cudaDev 5 busId 89000 - Init COMPLETE
PHLRR4036:24150:24264 [7] NCCL INFO comm 0x555644620380 rank 7 nranks 16 cudaDev 7 busId 91000 - Init COMPLETE
PHLRR4036:24140:24267 [1] NCCL INFO comm 0x5612afe990e0 rank 1 nranks 16 cudaDev 1 busId 8000 - Init COMPLETE
PHLRR4036:24142:24263 [3] NCCL INFO comm 0x5562191160f0 rank 3 nranks 16 cudaDev 3 busId 13000 - Init COMPLETE
PHLRR4036:24148:24269 [6] NCCL INFO comm 0x55cede6d3870 rank 6 nranks 16 cudaDev 6 busId 8e000 - Init COMPLETE
PHLRR4036:24139:24253 [0] NCCL INFO comm 0x5634028d2ec0 rank 0 nranks 16 cudaDev 0 busId 5000 - Init COMPLETE
PHLRR4036:24141:24296 [2] ib_plugin.c:978 NCCL WARN NET/IB : Got completion from peer 10.226.98.98<48598> with error 12, opcode 0, len 0, vendor err 129
PHLRR4036:24141:24296 [2] NCCL INFO include/net.h:35 -> 2
PHLRR4036:24141:24296 [2] NCCL INFO transport/net.cc:1034 -> 2
PHLRR4036:24141:24296 [2] NCCL INFO proxy.cc:520 -> 2
PHLRR4036:24141:24296 [2] NCCL INFO proxy.cc:684 -> 2 [Proxy Thread]
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL error: unhandled system error, NCCL version 2.14.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
NET/IB : Got completion from peer 10.226.98.98<48598> with error 12, opcode 0, len 0, vendor err 129
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 24139 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 24140 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 24142 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 24144 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 24146 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 24148 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 24150 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 2 (pid: 24141) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.13.0a0+d0d6b1f', 'console_scripts', 'torchrun')())
File "/home/lambda7xx/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/lambda7xx/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/lambda7xx/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/lambda7xx/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/lambda7xx/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
torchrun --node_rank=1 --nnodes 2 --nproc_per_node=8 --master_addr xxxxx --master_port 6000 bloom-inference-scripts/bloom-ds-inference.py --name bigscience/bloom > 1.log 2>&1
deepspeed --num_gpus 8 --num_nodes 2 --master_addr XXXX --hostfile hostfile bloom-inference-scripts/bloom-ds-inference.py --name bigscience/bloom > 1.log 2>&1
WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2023-03-03 02:21:44,523] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl *** Loading the model bigscience/bloom
Fetching 197 files: 0%| | 0/197 [00:00<?, ?it/s] Fetching 197 files: 100%|██████████| 197/197 [00:00<00:00, 4156.77it/s] PHLRR4036:24139:24139 [0] NCCL INFO Bootstrap : Using ens9f1:10.226.99.46<0> PHLRR4036:24139:24139 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. PHLRR4036:24139:24139 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) PHLRR4036:24139:24139 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. PHLRR4036:24139:24139 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) [1677810113.706248] [PHLRR4036:24139:f] vfs_fuse.c:281 UCX ERROR inotify_add_watch(/tmp) failed: No space left on device PHLRR4036:24139:24139 [0] NCCL INFO cudaDriverVersion 11080 NCCL version 2.14.3+cuda11.7 PHLRR4036:24141:24141 [2] NCCL INFO cudaDriverVersion 11080 PHLRR4036:24150:24150 [7] NCCL INFO cudaDriverVersion 11080 PHLRR4036:24148:24148 [6] NCCL INFO cudaDriverVersion 11080 PHLRR4036:24146:24146 [5] NCCL INFO cudaDriverVersion 11080 PHLRR4036:24140:24140 [1] NCCL INFO cudaDriverVersion 11080 PHLRR4036:24142:24142 [3] NCCL INFO cudaDriverVersion 11080 PHLRR4036:24139:24253 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so PHLRR4036:24139:24253 [0] NCCL INFO P2P plugin IBext PHLRR4036:24144:24144 [4] NCCL INFO cudaDriverVersion 11080 PHLRR4036:24141:24141 [2] NCCL INFO Bootstrap : Using ens9f1:10.226.99.46<0> PHLRR4036:24142:24142 [3] NCCL INFO Bootstrap : Using ens9f1:10.226.99.46<0> PHLRR4036:24150:24150 [7] NCCL INFO Bootstrap : Using ens9f1:10.226.99.46<0> PHLRR4036:24140:24140 [1] NCCL INFO Bootstrap : Using ens9f1:10.226.99.46<0> PHLRR4036:24146:24146 [5] NCCL INFO Bootstrap : Using ens9f1:10.226.99.46<0> PHLRR4036:24148:24148 [6] NCCL INFO Bootstrap : Using ens9f1:10.226.99.46<0> PHLRR4036:24141:24141 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. PHLRR4036:24141:24141 [2] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) PHLRR4036:24141:24141 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. PHLRR4036:24141:24141 [2] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) [1677810113.720920] [PHLRR4036:24141:f] vfs_fuse.c:281 UCX ERROR inotify_add_watch(/tmp) failed: No space left on device PHLRR4036:24142:24142 [3] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. PHLRR4036:24142:24142 [3] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) PHLRR4036:24142:24142 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. PHLRR4036:24142:24142 [3] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) PHLRR4036:24150:24150 [7] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. PHLRR4036:24150:24150 [7] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) PHLRR4036:24150:24150 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. PHLRR4036:24150:24150 [7] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) [1677810113.723766] [PHLRR4036:24142:f] vfs_fuse.c:281 UCX ERROR inotify_add_watch(/tmp) failed: No space left on device PHLRR4036:24141:24258 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so [1677810113.723912] [PHLRR4036:24150:f] vfs_fuse.c:281 UCX ERROR inotify_add_watch(/tmp) failed: No space left on device PHLRR4036:24141:24258 [2] NCCL INFO P2P plugin IBext PHLRR4036:24144:24144 [4] NCCL INFO Bootstrap : Using ens9f1:10.226.99.46<0> PHLRR4036:24140:24140 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. PHLRR4036:24140:24140 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) PHLRR4036:24140:24140 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. PHLRR4036:24140:24140 [1] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) [1677810113.725985] [PHLRR4036:24140:f] vfs_fuse.c:281 UCX ERROR inotify_add_watch(/tmp) failed: No space left on device PHLRR4036:24148:24148 [6] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. PHLRR4036:24148:24148 [6] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) PHLRR4036:24148:24148 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. PHLRR4036:24148:24148 [6] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) PHLRR4036:24146:24146 [5] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. PHLRR4036:24146:24146 [5] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) PHLRR4036:24146:24146 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. PHLRR4036:24146:24146 [5] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) [1677810113.726464] [PHLRR4036:24148:f] vfs_fuse.c:281 UCX ERROR inotify_add_watch(/tmp) failed: No space left on device [1677810113.726557] [PHLRR4036:24146:f] vfs_fuse.c:281 UCX ERROR inotify_add_watch(/tmp) failed: No space left on device PHLRR4036:24142:24263 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so PHLRR4036:24142:24263 [3] NCCL INFO P2P plugin IBext PHLRR4036:24150:24264 [7] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so PHLRR4036:24150:24264 [7] NCCL INFO P2P plugin IBext PHLRR4036:24144:24144 [4] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol. PHLRR4036:24144:24144 [4] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5) PHLRR4036:24144:24144 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol. PHLRR4036:24144:24144 [4] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5) [1677810113.729624] [PHLRR4036:24144:f] vfs_fuse.c:281 UCX ERROR inotify_add_watch(/tmp) failed: No space left on device PHLRR4036:24139:24253 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.99.46<0> PHLRR4036:24140:24267 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so PHLRR4036:24140:24267 [1] NCCL INFO P2P plugin IBext PHLRR4036:24139:24253 [0] NCCL INFO Using network IBext PHLRR4036:24146:24268 [5] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so PHLRR4036:24146:24268 [5] NCCL INFO P2P plugin IBext PHLRR4036:24148:24269 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so PHLRR4036:24148:24269 [6] NCCL INFO P2P plugin IBext PHLRR4036:24144:24272 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so PHLRR4036:24144:24272 [4] NCCL INFO P2P plugin IBext PHLRR4036:24141:24258 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.99.46<0> PHLRR4036:24141:24258 [2] NCCL INFO Using network IBext PHLRR4036:24142:24263 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.99.46<0> PHLRR4036:24142:24263 [3] NCCL INFO Using network IBext PHLRR4036:24150:24264 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.99.46<0> PHLRR4036:24150:24264 [7] NCCL INFO Using network IBext PHLRR4036:24146:24268 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.99.46<0> PHLRR4036:24146:24268 [5] NCCL INFO Using network IBext PHLRR4036:24148:24269 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.99.46<0> PHLRR4036:24148:24269 [6] NCCL INFO Using network IBext PHLRR4036:24140:24267 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.99.46<0> PHLRR4036:24140:24267 [1] NCCL INFO Using network IBext PHLRR4036:24144:24272 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/IB [RO]; OOB ens9f1:10.226.99.46<0> PHLRR4036:24144:24272 [4] NCCL INFO Using network IBext PHLRR4036:24148:24269 [6] NCCL INFO Setting affinity for GPU 6 to 0fe03f80 PHLRR4036:24141:24258 [2] NCCL INFO Setting affinity for GPU 2 to 1fc07f PHLRR4036:24146:24268 [5] NCCL INFO Setting affinity for GPU 5 to 0fe03f80 PHLRR4036:24139:24253 [0] NCCL INFO Setting affinity for GPU 0 to 1fc07f PHLRR4036:24142:24263 [3] NCCL INFO Setting affinity for GPU 3 to 1fc07f PHLRR4036:24140:24267 [1] NCCL INFO Setting affinity for GPU 1 to 1fc07f PHLRR4036:24150:24264 [7] NCCL INFO Setting affinity for GPU 7 to 0fe03f80 PHLRR4036:24144:24272 [4] NCCL INFO Setting affinity for GPU 4 to 0fe03f80 PHLRR4036:24139:24253 [0] NCCL INFO Channel 00/02 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 PHLRR4036:24141:24258 [2] NCCL INFO Trees [0] 3/10/-1->2->-1 [1] 3/-1/-1->2->10 PHLRR4036:24140:24267 [1] NCCL INFO Trees [0] 4/-1/-1->1->0 [1] 4/-1/-1->1->0 PHLRR4036:24150:24264 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 PHLRR4036:24148:24269 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 PHLRR4036:24146:24268 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 PHLRR4036:24142:24263 [3] NCCL INFO Trees [0] 0/-1/-1->3->2 [1] 0/-1/-1->3->2 PHLRR4036:24144:24272 [4] NCCL INFO Trees [0] 5/-1/-1->4->1 [1] 5/-1/-1->4->1 PHLRR4036:24139:24253 [0] NCCL INFO Channel 01/02 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 PHLRR4036:24139:24253 [0] NCCL INFO Trees [0] 1/-1/-1->0->3 [1] 1/-1/-1->0->3 PHLRR4036:24144:24272 [4] NCCL INFO Channel 00/0 : 3[13000] -> 4[83000] [receive] via NET/IBext/1/GDRDMA PHLRR4036:24139:24253 [0] NCCL INFO Channel 00/0 : 15[91000] -> 0[5000] [receive] via NET/IBext/0 PHLRR4036:24140:24267 [1] NCCL INFO Channel 00 : 1[8000] -> 2[d000] via SHM/direct/direct PHLRR4036:24146:24268 [5] NCCL INFO Channel 00 : 5[89000] -> 6[8e000] via SHM/direct/direct PHLRR4036:24140:24267 [1] NCCL INFO Channel 01 : 1[8000] -> 2[d000] via SHM/direct/direct PHLRR4036:24146:24268 [5] NCCL INFO Channel 01 : 5[89000] -> 6[8e000] via SHM/direct/direct PHLRR4036:24141:24258 [2] NCCL INFO Channel 00/0 : 2[d000] -> 3[13000] via P2P/IPC PHLRR4036:24148:24269 [6] NCCL INFO Channel 00/0 : 6[8e000] -> 7[91000] via P2P/IPC PHLRR4036:24141:24258 [2] NCCL INFO Channel 01/0 : 2[d000] -> 3[13000] via P2P/IPC PHLRR4036:24148:24269 [6] NCCL INFO Channel 01/0 : 6[8e000] -> 7[91000] via P2P/IPC PHLRR4036:24150:24264 [7] NCCL INFO Channel 00/0 : 7[91000] -> 8[5000] [send] via NET/IBext/1 PHLRR4036:24142:24263 [3] NCCL INFO Channel 00/0 : 3[13000] -> 4[83000] [send] via NET/IBext/1 PHLRR4036:24139:24253 [0] NCCL INFO Channel 01/0 : 15[91000] -> 0[5000] [receive] via NET/IBext/0 PHLRR4036:24144:24272 [4] NCCL INFO Channel 01/0 : 3[13000] -> 4[83000] [receive] via NET/IBext/1/GDRDMA PHLRR4036:24139:24253 [0] NCCL INFO Channel 00/0 : 0[5000] -> 1[8000] via P2P/IPC PHLRR4036:24144:24272 [4] NCCL INFO Channel 00/0 : 4[83000] -> 5[89000] via P2P/IPC PHLRR4036:24139:24253 [0] NCCL INFO Channel 01/0 : 0[5000] -> 1[8000] via P2P/IPC PHLRR4036:24144:24272 [4] NCCL INFO Channel 01/0 : 4[83000] -> 5[89000] via P2P/IPC PHLRR4036:24140:24267 [1] NCCL INFO Connected all rings PHLRR4036:24140:24267 [1] NCCL INFO Channel 00 : 1[8000] -> 4[83000] via SHM/direct/direct PHLRR4036:24140:24267 [1] NCCL INFO Channel 01 : 1[8000] -> 4[83000] via SHM/direct/direct PHLRR4036:24146:24268 [5] NCCL INFO Connected all rings PHLRR4036:24146:24268 [5] NCCL INFO Channel 00/0 : 5[89000] -> 4[83000] via P2P/IPC PHLRR4036:24146:24268 [5] NCCL INFO Channel 01/0 : 5[89000] -> 4[83000] via P2P/IPC PHLRR4036:24150:24264 [7] NCCL INFO Channel 01/0 : 7[91000] -> 8[5000] [send] via NET/IBext/1 PHLRR4036:24139:24287 [0] NCCL INFO NCCL_IB_TIMEOUT set by environment to 300. PHLRR4036:24142:24263 [3] NCCL INFO Channel 01/0 : 3[13000] -> 4[83000] [send] via NET/IBext/1 PHLRR4036:24148:24269 [6] NCCL INFO Connected all rings PHLRR4036:24148:24269 [6] NCCL INFO Channel 00 : 6[8e000] -> 5[89000] via SHM/direct/direct PHLRR4036:24148:24269 [6] NCCL INFO Channel 01 : 6[8e000] -> 5[89000] via SHM/direct/direct PHLRR4036:24141:24258 [2] NCCL INFO Connected all rings PHLRR4036:24144:24291 [4] NCCL INFO NCCL_IB_TIMEOUT set by environment to 300. PHLRR4036:24141:24258 [2] NCCL INFO Channel 00/0 : 10[d000] -> 2[d000] [receive] via NET/IBext/0/GDRDMA PHLRR4036:24139:24253 [0] NCCL INFO Connected all rings PHLRR4036:24150:24264 [7] NCCL INFO Connected all rings PHLRR4036:24150:24264 [7] NCCL INFO Channel 00/0 : 7[91000] -> 6[8e000] via P2P/IPC PHLRR4036:24139:24253 [0] NCCL INFO Channel 00 : 0[5000] -> 3[13000] via SHM/direct/direct PHLRR4036:24142:24263 [3] NCCL INFO Connected all rings PHLRR4036:24139:24253 [0] NCCL INFO Channel 01 : 0[5000] -> 3[13000] via SHM/direct/direct PHLRR4036:24150:24264 [7] NCCL INFO Channel 01/0 : 7[91000] -> 6[8e000] via P2P/IPC PHLRR4036:24150:24264 [7] NCCL INFO Connected all trees PHLRR4036:24150:24264 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 PHLRR4036:24150:24264 [7] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer PHLRR4036:24148:24269 [6] NCCL INFO Connected all trees PHLRR4036:24148:24269 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 PHLRR4036:24148:24269 [6] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer PHLRR4036:24141:24258 [2] NCCL INFO Channel 01/0 : 10[d000] -> 2[d000] [receive] via NET/IBext/0/GDRDMA PHLRR4036:24142:24263 [3] NCCL INFO Channel 00 : 3[13000] -> 0[5000] via SHM/direct/direct PHLRR4036:24142:24263 [3] NCCL INFO Channel 01 : 3[13000] -> 0[5000] via SHM/direct/direct PHLRR4036:24144:24272 [4] NCCL INFO Connected all rings PHLRR4036:24144:24272 [4] NCCL INFO Channel 00 : 4[83000] -> 1[8000] via SHM/direct/direct PHLRR4036:24144:24272 [4] NCCL INFO Channel 01 : 4[83000] -> 1[8000] via SHM/direct/direct PHLRR4036:24142:24263 [3] NCCL INFO Channel 00/0 : 3[13000] -> 2[d000] via P2P/IPC PHLRR4036:24142:24263 [3] NCCL INFO Channel 01/0 : 3[13000] -> 2[d000] via P2P/IPC PHLRR4036:24141:24258 [2] NCCL INFO Channel 00/0 : 2[d000] -> 10[d000] [send] via NET/IBext/0 PHLRR4036:24140:24267 [1] NCCL INFO Channel 00/0 : 1[8000] -> 0[5000] via P2P/IPC PHLRR4036:24140:24267 [1] NCCL INFO Channel 01/0 : 1[8000] -> 0[5000] via P2P/IPC PHLRR4036:24139:24253 [0] NCCL INFO Connected all trees PHLRR4036:24140:24267 [1] NCCL INFO Connected all trees PHLRR4036:24140:24267 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 PHLRR4036:24139:24253 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 PHLRR4036:24139:24253 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer PHLRR4036:24140:24267 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer PHLRR4036:24144:24272 [4] NCCL INFO Connected all trees PHLRR4036:24144:24272 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 PHLRR4036:24144:24272 [4] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer PHLRR4036:24146:24268 [5] NCCL INFO Connected all trees PHLRR4036:24146:24268 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 PHLRR4036:24146:24268 [5] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer PHLRR4036:24141:24258 [2] NCCL INFO Channel 01/0 : 2[d000] -> 10[d000] [send] via NET/IBext/0 PHLRR4036:24141:24285 [2] NCCL INFO NCCL_IB_TIMEOUT set by environment to 300. PHLRR4036:24141:24258 [2] NCCL INFO Connected all trees PHLRR4036:24141:24258 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 PHLRR4036:24141:24258 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer PHLRR4036:24142:24263 [3] NCCL INFO Connected all trees PHLRR4036:24142:24263 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 PHLRR4036:24142:24263 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer PHLRR4036:24141:24258 [2] NCCL INFO comm 0x55ee705f7690 rank 2 nranks 16 cudaDev 2 busId d000 - Init COMPLETE PHLRR4036:24144:24272 [4] NCCL INFO comm 0x564611e72a30 rank 4 nranks 16 cudaDev 4 busId 83000 - Init COMPLETE PHLRR4036:24146:24268 [5] NCCL INFO comm 0x55822f760a60 rank 5 nranks 16 cudaDev 5 busId 89000 - Init COMPLETE PHLRR4036:24150:24264 [7] NCCL INFO comm 0x555644620380 rank 7 nranks 16 cudaDev 7 busId 91000 - Init COMPLETE PHLRR4036:24140:24267 [1] NCCL INFO comm 0x5612afe990e0 rank 1 nranks 16 cudaDev 1 busId 8000 - Init COMPLETE PHLRR4036:24142:24263 [3] NCCL INFO comm 0x5562191160f0 rank 3 nranks 16 cudaDev 3 busId 13000 - Init COMPLETE PHLRR4036:24148:24269 [6] NCCL INFO comm 0x55cede6d3870 rank 6 nranks 16 cudaDev 6 busId 8e000 - Init COMPLETE PHLRR4036:24139:24253 [0] NCCL INFO comm 0x5634028d2ec0 rank 0 nranks 16 cudaDev 0 busId 5000 - Init COMPLETE
PHLRR4036:24141:24296 [2] ib_plugin.c:978 NCCL WARN NET/IB : Got completion from peer 10.226.98.98<48598> with error 12, opcode 0, len 0, vendor err 129 PHLRR4036:24141:24296 [2] NCCL INFO include/net.h:35 -> 2 PHLRR4036:24141:24296 [2] NCCL INFO transport/net.cc:1034 -> 2 PHLRR4036:24141:24296 [2] NCCL INFO proxy.cc:520 -> 2 PHLRR4036:24141:24296 [2] NCCL INFO proxy.cc:684 -> 2 [Proxy Thread] [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): NCCL error: unhandled system error, NCCL version 2.14.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. Last error: NET/IB : Got completion from peer 10.226.98.98<48598> with error 12, opcode 0, len 0, vendor err 129 WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 24139 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 24140 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 24142 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 24144 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 24146 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 24148 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 24150 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 2 (pid: 24141) of binary: /opt/conda/bin/python Traceback (most recent call last): File "/opt/conda/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==1.13.0a0+d0d6b1f', 'console_scripts', 'torchrun')())
File "/home/lambda7xx/.local/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/lambda7xx/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/lambda7xx/.local/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/lambda7xx/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/lambda7xx/.local/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
bloom-inference-scripts/bloom-ds-inference.py FAILED
Failures: