All ranks are not trained. They are blocked all the time

msr-fiddle / pipedream

MIT License

379 stars 117 forks source link

My environment:

server1：4GPUS

server2 : 4GPUS

Initialization has been completed. All ranks are not trained. They are blocked all the time

Here is the output of each rank：

in rank0: Finished initializing process group; backend: gloo, rank: 0, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank1: Finished initializing process group; backend: gloo, rank: 1, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank2: Finished initializing process group; backend: gloo, rank: 2, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank3: Finished initializing process group; backend: gloo, rank: 3, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank4: Finished initializing process group; backend: gloo, rank: 4, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.209 (1.856) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002

in rank5 : Finished initializing process group; backend: gloo, rank: 5, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.209 (1.856) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002

in rank6: Finished initializing process group; backend: gloo, rank: 6, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.156 (1.804) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002

in rank7: Finished initializing process group; backend: gloo, rank: 7, world_size: 8 Send ranks: {} Receive ranks: {'out1': [4, 5, 6], 'target': [4, 5, 6]} Setting up process groups for broadcasts... Letting in 0 warm-up minibatches Running training for 20016 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690624.000 bytes send_tensors 0.000 seconds send_tensors_size 0.000 bytes Epoch: 0 Step 0 Learning rate: 0.010000 Epoch: [0][0/20016] Time: 8.293 (8.293) Epoch time [hr]: 0.002 (46.107) Memory: 1.284 (1.636) Loss: 6.9063 (6.9063) Prec@1: 0.000 (0.000)Prec@5: 0.000 (0.000) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 25690112.000 bytes Optimizer step took: 0.005

Finally it throw the runtime error :

Exception in thread Thread-9: Traceback (most recent call last): File "/opt/conda/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/opt/conda/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "../communication.py", line 619, in recv_helper_thread sub_process_group=sub_process_group) File "../communication.py", line 654, in _recv group=sub_process_group) File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 755, in broadcast work.wait() RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:72] Timed out waiting 1800000ms for recv operation to complete

My environment:

server1：4GPUS

server2 : 4GPUS

Initialization has been completed. All ranks are not trained. They are blocked all the time

Here is the output of each rank：

in rank0: Finished initializing process group; backend: gloo, rank: 0, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank1: Finished initializing process group; backend: gloo, rank: 1, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank2: Finished initializing process group; backend: gloo, rank: 2, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank3: Finished initializing process group; backend: gloo, rank: 3, world_size: 8 Replicating stage: ranks=4, module_size=775424.000 Send ranks: {'out0': [4, 5, 6], 'target': [4, 5, 6]} Receive ranks: {} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 5004 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 51380736.000 bytes Epoch: 0 Step 0 Learning rate: 0.040000 Epoch: [0][0/5004] Memory: 8.499 (9.475)

in rank4: Finished initializing process group; backend: gloo, rank: 4, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.209 (1.856) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002

in rank5 : Finished initializing process group; backend: gloo, rank: 5, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.209 (1.856) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002

in rank6: Finished initializing process group; backend: gloo, rank: 6, world_size: 8 Replicating stage: ranks=3, module_size=3678208.000 Send ranks: {'out1': [7], 'target': [7]} Receive ranks: {'out0': [0, 1, 2, 3], 'target': [0, 1, 2, 3]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 6672 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 51380736.000 bytes send_tensors 0.000 seconds send_tensors_size 25690624.000 bytes Epoch: 0 Step 0 Learning rate: 0.030000 Epoch: [0][0/6672] Memory: 1.156 (1.804) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690112.000 bytes send_tensors 0.000 seconds send_tensors_size 51380224.000 bytes Optimizer step took: 0.002

in rank7: Finished initializing process group; backend: gloo, rank: 7, world_size: 8 Send ranks: {} Receive ranks: {'out1': [4, 5, 6], 'target': [4, 5, 6]} Setting up process groups for broadcasts... Letting in 0 warm-up minibatches Running training for 20016 minibatches Forward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 25690624.000 bytes send_tensors 0.000 seconds send_tensors_size 0.000 bytes Epoch: 0 Step 0 Learning rate: 0.010000 Epoch: [0][0/20016] Time: 8.293 (8.293) Epoch time [hr]: 0.002 (46.107) Memory: 1.284 (1.636) Loss: 6.9063 (6.9063) Prec@1: 0.000 (0.000)Prec@5: 0.000 (0.000) Backward Stats: compute_time 0.000 seconds receive_tensors 0.000 seconds receive_tensors_size 0.000 bytes send_tensors 0.000 seconds send_tensors_size 25690112.000 bytes Optimizer step took: 0.005

I have the same issue, have you solved this problem? Thank you.

msr-fiddle / pipedream

All ranks are not trained. They are blocked all the time #39

Finally it throw the runtime error :