Closed gudiandian closed 4 years ago
Besides, I tried to print some logs for more information. Before the training started, there were two calls to _recv() on rank 0 and 1:
receive from 2 tag: 2
receive from 3 tag: 2
After the training started, there were two calls to _send on rank 0:
Epoch: 0 Step 0 Learning rate: 0.100000
send from 0 to 2 tag: 2
Epoch: [0][0/48] Memory: 0.713 (2.716)
send from 0 to 2 tag: 5
two calls to _send on rank 1:
Epoch: 0 Step 0 Learning rate: 0.100000
send from 1 to 3 tag 2
send from 1 to 3 tag 5
Epoch: [0][0/48] Memory: 0.713 (2.716)
But on rank 2 and rank 3, there were four calls to _recv each:
receive from 0 tag: 2
receive from 1 tag: 2
receive from 0 tag: 5
receive from 1 tag: 5
There seemed to be an inconsistency between the NCCL communications on these ranks. I am not sure is this the cause of the problem and why is this happening? Thank you.
Hi @MonicaGu, are you using this commit: https://github.com/msr-fiddle/pipedream/commit/cad624f79a71f44ba79099f0c38321347b13e5c2?
Hi @MonicaGu, are you using this commit: cad624f?
Yes I am using the latest commit. Thanks for checking this out!
I just used NCCL environment variables to see the NCCL logs. I found that there was no NCCL broadcast collective at all, which confuses me. And none of the calls to dist.broadcast returned.
Oh! You're using NCCL for inter-stage communication. That won't work, since NCCL isn't really thread-safe (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#using-multiple-nccl-communicators-concurrently).
Thank you so much! I should have read the README more carefully.
I used 4 GPUs on 1 server to train the vgg16 model with the configs in the repo. However, the processes always stuck at the first communication between ranks in training. After the first stage finished running forward, the second stage never started running forward because it could not receive data from the first stage. The commands and outputs are like this: On rank 0:
The output on rank 1 is almost the same.
On rank 2:
On rank 3:
I have tried other configs as well. When I used the data parallel config, there was no such problem. The problem only occurred when I used the model parallel config or the hybrid ones. So this seems to be a problem with model parallelism.
Besides, when broadcasting tensor shapes, there was the error
RuntimeError: Tensors must be CUDA and dense
, so I added ".cuda()" in lines that created the "tensor_shape" tensor myself. But I think it is not related to this problem. I did not change anything else of Pipedream except the dataset and a little bit of the model structure.I also ran the code /runtime/tests/communication/point_to_point.py with nccl backend and there seemed to be no problem:
I don't know if I used the wrong command line or there is some bug in your code. I would really appreciate your help!