msr-fiddle / pipedream

MIT License
379 stars 117 forks source link

Communication error when training with Pipedream #56

Closed gudiandian closed 4 years ago

gudiandian commented 4 years ago

I used 4 GPUs on 1 server to train the vgg16 model with the configs in the repo. However, the processes always stuck at the first communication between ranks in training. After the first stage finished running forward, the second stage never started running forward because it could not receive data from the first stage. The commands and outputs are like this: On rank 0:

jxt@graphics-gpu:~/test/pipedream/runtime/image_classification$ python main_with_runtime.py --module models.vgg16.gpus=4 -b 512 --data_dir /my/path/to data --rank 0 --local_rank 0 --master_addr 127.0.0.1 --config_path models/vgg16/gpus=4/hybrid_conf_2-2.json --distributed_backend nccl --epochs 1 --no_input_pipelining --num_ranks_in_server 4
Finished initializing process group; backend: nccl, rank: 0, world_size: 4
Replicating stage: ranks=2, module_size=775424.000
Send ranks:  {'out0': [2, 3], 'target': [2, 3]}
Receive ranks:  {}
Setting up process groups for broadcasts...

Epoch: 0 Step 0     Learning rate: 0.100000
Epoch: [0][0/48]    Memory: 0.713 (2.716)

The output on rank 1 is almost the same.

On rank 2:

jxt@graphics-gpu:~/test/pipedream/runtime/image_classification$ python main_with_runtime.py --module models.vgg16.gpus=4 -b 512 --data_dir  /my/path/to data --rank 2 --local_rank 2 --master_addr 127.0.0.1 --config_path models/vgg16/gpus=4/hybrid_conf_2-2.json --distributed_backend nccl --epochs 1 --no_input_pipelining --num_ranks_in_server 4
Finished initializing process group; backend: nccl, rank: 2, world_size: 4
Replicating stage: ranks=2, module_size=6578216.000
Send ranks:  {}
Receive ranks:  {'out0': [0, 1], 'target': [0, 1]}
Setting up process groups for broadcasts...
Exception in thread Thread-6:
Traceback (most recent call last):
  File "/home/jxt/anaconda3/envs/pytorch/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/jxt/anaconda3/envs/pytorch/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "../communication.py", line 609, in recv_helper_thread
    sub_process_group=sub_process_group)
  File "../communication.py", line 644, in _recv
    group=sub_process_group)
  File "/home/jxt/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 808, in broadcast
    work = group.broadcast([tensor], opts)
RuntimeError: Stop_waiting response is expected

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/home/jxt/anaconda3/envs/pytorch/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/jxt/anaconda3/envs/pytorch/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "../communication.py", line 609, in recv_helper_thread
    sub_process_group=sub_process_group)
  File "../communication.py", line 644, in _recv
    group=sub_process_group)
  File "/home/jxt/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 808, in broadcast
    work = group.broadcast([tensor], opts)
RuntimeError: Stop_waiting response is expected

Exception in thread Thread-2:
Traceback (most recent call last):
  File "/home/jxt/anaconda3/envs/pytorch/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/jxt/anaconda3/envs/pytorch/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "../communication.py", line 609, in recv_helper_thread
    sub_process_group=sub_process_group)
  File "../communication.py", line 644, in _recv
    group=sub_process_group)
  File "/home/jxt/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 808, in broadcast
    work = group.broadcast([tensor], opts)
MemoryError: std::bad_alloc

Exception in thread Thread-5:
Traceback (most recent call last):
  File "/home/jxt/anaconda3/envs/pytorch/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/jxt/anaconda3/envs/pytorch/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "../communication.py", line 609, in recv_helper_thread
    sub_process_group=sub_process_group)
  File "../communication.py", line 644, in _recv
    group=sub_process_group)
  File "/home/jxt/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 808, in broadcast
    work = group.broadcast([tensor], opts)
MemoryError: std::bad_alloc

On rank 3:

jxt@graphics-gpu:~/test/pipedream/runtime/image_classification$ python main_with_runtime.py --module models.vgg16.gpus=4 -b 512 --data_dir / /my/path/to data --rank 3 --local_rank 3 --master_addr 127.0.0.1 --config_path models/vgg16/gpus=4/hybrid_conf_2-2.json --distributed_backend nccl --epochs 1 --no_input_pipelining --num_ranks_in_server 4
Finished initializing process group; backend: nccl, rank: 3, world_size: 4
Replicating stage: ranks=2, module_size=6578216.000
Send ranks:  {}
Receive ranks:  {'out0': [0, 1], 'target': [0, 1]}
Setting up process groups for broadcasts...

I have tried other configs as well. When I used the data parallel config, there was no such problem. The problem only occurred when I used the model parallel config or the hybrid ones. So this seems to be a problem with model parallelism.

Besides, when broadcasting tensor shapes, there was the error RuntimeError: Tensors must be CUDA and dense, so I added ".cuda()" in lines that created the "tensor_shape" tensor myself. But I think it is not related to this problem. I did not change anything else of Pipedream except the dataset and a little bit of the model structure.

I also ran the code /runtime/tests/communication/point_to_point.py with nccl backend and there seemed to be no problem:

python point_to_point.py --backend nccl --master_addr localhost --rank 1 --master_port 8888 --broadcast
Local rank: 1
Time to receive 4e-05 MB: 0.003 seconds
Throughput: 0.000 GB/s
Time to receive 0.0004 MB: 0.004 seconds
Throughput: 0.000 GB/s
Time to receive 0.004 MB: 0.003 seconds
Throughput: 0.001 GB/s
Time to receive 0.04 MB: 0.004 seconds
Throughput: 0.011 GB/s
Time to receive 0.4 MB: 0.004 seconds
Throughput: 0.113 GB/s
Time to receive 4.0 MB: 0.003 seconds
Throughput: 1.166 GB/s
Time to receive 40.0 MB: 0.003 seconds
Throughput: 11.646 GB/s
Time to receive 400.0 MB: 0.004 seconds
Throughput: 113.415 GB/s
Time to receive 3200.0 MB: 0.004 seconds
Throughput: 878.420 GB/s

I don't know if I used the wrong command line or there is some bug in your code. I would really appreciate your help!

gudiandian commented 4 years ago

Besides, I tried to print some logs for more information. Before the training started, there were two calls to _recv() on rank 0 and 1:

receive from 2 tag:  2
receive from 3 tag:  2

After the training started, there were two calls to _send on rank 0:

Epoch: 0 Step 0     Learning rate: 0.100000
send from 0 to  2 tag:  2
Epoch: [0][0/48]    Memory: 0.713 (2.716)
send from 0 to  2 tag:  5

two calls to _send on rank 1:

Epoch: 0 Step 0     Learning rate: 0.100000
send from 1 to  3 tag 2
send from 1 to  3 tag 5
Epoch: [0][0/48]    Memory: 0.713 (2.716)

But on rank 2 and rank 3, there were four calls to _recv each:

receive from 0 tag:  2
receive from 1 tag:  2
receive from 0 tag:  5
receive from 1 tag:  5

There seemed to be an inconsistency between the NCCL communications on these ranks. I am not sure is this the cause of the problem and why is this happening? Thank you.

deepakn94 commented 4 years ago

Hi @MonicaGu, are you using this commit: https://github.com/msr-fiddle/pipedream/commit/cad624f79a71f44ba79099f0c38321347b13e5c2?

gudiandian commented 4 years ago

Hi @MonicaGu, are you using this commit: cad624f?

Yes I am using the latest commit. Thanks for checking this out!

gudiandian commented 4 years ago

I just used NCCL environment variables to see the NCCL logs. I found that there was no NCCL broadcast collective at all, which confuses me. And none of the calls to dist.broadcast returned.

deepakn94 commented 4 years ago

Oh! You're using NCCL for inter-stage communication. That won't work, since NCCL isn't really thread-safe (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#using-multiple-nccl-communicators-concurrently).

gudiandian commented 4 years ago

Thank you so much! I should have read the README more carefully.