Open Shuai-Xie opened 3 years ago
Interesting. It should be the same since you are using the same dataset and the same random seed.
Can you please post the code here?
/cc @zw0610 @kubeflow/wg-training-leads
Sure @gaocegege. I've posted the code here: https://github.com/Shuai-Xie/mnist-pytorchjob-example.
Thanks for your kind reply.
Hi @gaocegege, I've got more results and proved that random state isn't the problem.
This function below is used to validate that BM and PJ generate the same random values when using the same seed.
Each process launched by DDP training will invoke this function and print the random values.
def print_random_vals(rank, num=1):
for n in range(num):
print('rank {} group {}, random: {}, np: {}, torch: {}'.format(
rank, n, random.random(), np.random.rand(1), torch.rand(1)))
Here are the results.
# seed = 1
# 48
rank 0 group 0, random: 0.13436424411240122, np: [0.417022], torch: tensor([0.7576])
rank 2 group 0, random: 0.13436424411240122, np: [0.417022], torch: tensor([0.7576])
rank 3 group 0, random: 0.13436424411240122, np: [0.417022], torch: tensor([0.7576])
rank 1 group 0, random: 0.13436424411240122, np: [0.417022], torch: tensor([0.7576])
# 49
rank 5 group 0, random: 0.13436424411240122, np: [0.417022], torch: tensor([0.7576])
rank 7 group 0, random: 0.13436424411240122, np: [0.417022], torch: tensor([0.7576])
rank 4 group 0, random: 0.13436424411240122, np: [0.417022], torch: tensor([0.7576])
rank 6 group 0, random: 0.13436424411240122, np: [0.417022], torch: tensor([0.7576])
# seed = 10
# 48
rank 2 group 0, random: 0.5714025946899135, np: [0.77132064], torch: tensor([0.4581])
rank 3 group 0, random: 0.5714025946899135, np: [0.77132064], torch: tensor([0.4581])
rank 1 group 0, random: 0.5714025946899135, np: [0.77132064], torch: tensor([0.4581])
rank 0 group 0, random: 0.5714025946899135, np: [0.77132064], torch: tensor([0.4581])
# 49
rank 7 group 0, random: 0.5714025946899135, np: [0.77132064], torch: tensor([0.4581])
rank 5 group 0, random: 0.5714025946899135, np: [0.77132064], torch: tensor([0.4581])
rank 6 group 0, random: 0.5714025946899135, np: [0.77132064], torch: tensor([0.4581])
rank 4 group 0, random: 0.5714025946899135, np: [0.77132064], torch: tensor([0.4581])
# seed = 100
# 48
rank 2 group 0, random: 0.1456692551041303, np: [0.54340494], torch: tensor([0.1117])
rank 3 group 0, random: 0.1456692551041303, np: [0.54340494], torch: tensor([0.1117])
rank 1 group 0, random: 0.1456692551041303, np: [0.54340494], torch: tensor([0.1117])
rank 0 group 0, random: 0.1456692551041303, np: [0.54340494], torch: tensor([0.1117])
# 49
rank 6 group 0, random: 0.1456692551041303, np: [0.54340494], torch: tensor([0.1117])
rank 7 group 0, random: 0.1456692551041303, np: [0.54340494], torch: tensor([0.1117])
rank 4 group 0, random: 0.1456692551041303, np: [0.54340494], torch: tensor([0.1117])
rank 5 group 0, random: 0.1456692551041303, np: [0.54340494], torch: tensor([0.1117])
Still, we turn on the host network and print the random values generated by each process of the DDP training with different numbers of Pods (2/4/8, align to the experiments in 2. PJ DDP training).
Clearly, they are same to BM.
# seed = 1
# 8 Pods
rank 0 group 0, random: 0.13436424411240122, np: [0.417022], torch: tensor([0.7576])
rank 1 group 0, random: 0.13436424411240122, np: [0.417022], torch: tensor([0.7576])
...
# 4 Pods
rank 0 group 0, random: 0.13436424411240122, np: [0.417022], torch: tensor([0.7576])
rank 1 group 0, random: 0.13436424411240122, np: [0.417022], torch: tensor([0.7576])
...
# 2 Pods
rank 0 group 0, random: 0.13436424411240122, np: [0.417022], torch: tensor([0.7576])
rank 1 group 0, random: 0.13436424411240122, np: [0.417022], torch: tensor([0.7576])
...
# seed = 10
# 4 Pods
rank 0 group 0, random: 0.5714025946899135, np: [0.77132064], torch: tensor([0.4581])
rank 1 group 0, random: 0.5714025946899135, np: [0.77132064], torch: tensor([0.4581])
...
# seed = 100
# 8 Pods
rank 0 group 0, random: 0.1456692551041303, np: [0.54340494], torch: tensor([0.1117])
rank 1 group 0, random: 0.1456692551041303, np: [0.54340494], torch: tensor([0.1117])
...
Still turn on the host network.
# launch.py
# 8 Pod * 1 card
training seconds: 19.675018310546875
best_acc: 55.5
# 4 Pod * 2 card
training seconds: 19.006036043167114
best_acc: 55.5
# 2 Pod * 4 card
training seconds: 19.31661581993103 # same to BM
best_acc: 65.69
# mp tcp
# 8 Pod * 1 card
training seconds: 22.436177015304565
best_acc: 55.5
# 4 Pod * 2 card
training seconds: 21.974145889282227
best_acc: 59.18
# 2 Pod * 4 card
training seconds: 22.730929851531982 # same to BM
best_acc: 65.69
# launch.py
# 8 Pod * 1 card
training seconds: 19.475943565368652
best_acc: 61.46
# 4 Pod * 2 card
training seconds: 19.309614658355713
best_acc: 59.18
# 2 Pod * 4 card
training seconds: 20.99683380126953 # same to BM
best_acc: 69.3
# mp tcp
# 8 Pod * 1 card
training seconds: 22.409621953964233
best_acc: 59.18
# 4 Pod * 2 card
training seconds: 22.37164807319641
best_acc: 57.45
# 2 Pod * 4 card
training seconds: 22.644285917282104 # same to BM
best_acc: 69.3
*As before, only `2 Pods 4 card` gets the same results as BM.**
Others get quite random results, for example, the results of 8 Pod * 1 card
and 4 Pod * 2 card
are sometimes the same, sometimes different.
What confuses me most is that I still get different results even if I use the official recommended way to launch 8 Pods with the same seed. For example, when seed = 1, PJ has a different result to BM no matter the host network is true or false.
BM
Train Epoch: 0 [20/30] loss=0.9048
training seconds: 22.74862265586853
best_acc: 64.12
PJ
# host network = false
Train Epoch: 0 [20/30] loss=0.6989
training seconds: 20.179935455322266
best_acc: 67.97
Train Epoch: 0 [20/30] loss=0.6989 training seconds: 18.878486394882202 best_acc: 67.97
For now, I can only doubt that the PJ `allreuce` operation gathers grads different from BM.
Thanks a lot.
Things are getting weirder @gaocegege.
PytorchJob version may have an effect on the training reproduction https://github.com/kubeflow/pytorch-operator/issues/355#issue-1001742213.
Please let me know if I write the wrong code. Thanks a lot.
Thanks for your detailed reply!
PytorchJob version may have an effect on the training reproduction #355 (comment).
Do you mean PyTorch version?
😂 Yes @gaocegege. I'm sorry to make this mistake. I'll change it right away.
By the way, are there any clues now? Many thanks.
Dear developers, I got a new problem.
I've compared the DDP training process of PytorchJob (PJ) and Bare Metal (BM) and got different training results.
Experiment settings
Experiment results
1. BM DDP training
I record the training process of three ways to launch DDP training.
torch.distributed.launch
with default init_methodenv://
mp.spawn()
withtcp://
mp.spawn()
withfile://
And the results are below.
(1.1) 2 machine,nproc_per_node=4,nnodes=2
(1.2) 2 machine,nproc_per_node=2,nnodes=4
(1.3) 2 machine,nproc_per_node=1,nnodes=8
The 3 experiments above show that
) are equal (e.g. 8 in this setting), the training process has no relation with the number of distributed nodes
nnodes`**. Because the training loss is reproduced and the test accuracies are equal.2. PJ DDP traing
When using PJ DDP training, I also want to see the same results of BM.
However, the experiment results makes me confused.
Before doing the same experiment group like BM, I use the recommended way to launch DDP training.
The YAML file is blow.
It launch 8 pods, which is similar to the experiment (1.3).
However, I get the results below, which is quite different from BM results.
At first, I doubt BM OS and PytorchJob Pod OS generates different random states.
However, the following experiments show it's not the key.
We set
hostNetwork=ture
in all the experiments below.(2.1) 2 Pod * 4 cards
(2.2) 4 Pod * 2 cards
(2.3) 8 Pod * 1 cards
Only exp (2.1) gets the same results like BM. It really makes me confused.
Dear developers, please let me know if I made some mistakes.
Thanks a lot.
Happy Mid-Autumn Festival!