Shuai-Xie commented 3 years ago

Dear developers, I got a new problem.

I've compared the DDP training process of PytorchJob (PJ) and Bare Metal (BM) and got different training results.

Experiment settings

Two V100 GPU machines 48/49. Each has 4 cards. We have 8 GPUs in total.
DDP training resnet18 on mnist dataset with batchsize=256 and epochs=1
set random seed=1

Experiment results

1. BM DDP training

I record the training process of three ways to launch DDP training.

torch.distributed.launch with default init_method env://
mp.spawn() with tcp://
mp.spawn() with file://

And the results are below.

(1.1) 2 machine，nproc_per_node=4，nnodes=2

# launch
$ python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr="10.252.192.48" --master_port=22222 mnist_ddp_launch.py --epochs=1 --batch-size=256
$ python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=1 --master_addr="10.252.192.48" --master_port=22222 mnist_ddp_launch.py --epochs=1 --batch-size=256

Train Epoch: 0 [20/30]  loss=0.9048
training seconds: 37.32453942298889     # 18.465958833694458
best_acc: 64.12

# mp tcp
$ python mnist_ddp_mp.py --nproc_per_node=4 --nnodes=2 --node_rank=0 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
$ python mnist_ddp_mp.py --nproc_per_node=4 --nnodes=2 --node_rank=1 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256

Train Epoch: 0 [20/30]  loss=0.9048
training seconds: 41.56801748275757
best_acc: 64.12

# mp file
$ python mnist_ddp_mp.py --nproc_per_node=4 --nnodes=2 --node_rank=0 --dist-url="file:///export/nfs/xs/codes/pytorch_operator_example/sharedfile"  --epochs=1 --batch-size=256
$ python mnist_ddp_mp.py --nproc_per_node=4 --nnodes=2 --node_rank=1 --dist-url="file:///export/nfs/xs/codes/pytorch_operator_example/sharedfile"  --epochs=1 --batch-size=256

Train Epoch: 0 [20/30]  loss=0.9048
training seconds: 41.899426221847534
best_acc: 64.12

(1.2) 2 machine，nproc_per_node=2，nnodes=4

# launch
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --nnodes=4 --node_rank=0 --master_addr="10.252.192.49" --master_port=22222 mnist_ddp_launch.py --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=2,3 python -m torch.distributed.launch --nproc_per_node=2 --nnodes=4 --node_rank=1 --master_addr="10.252.192.49" --master_port=22222 mnist_ddp_launch.py --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --nnodes=4 --node_rank=2 --master_addr="10.252.192.49" --master_port=22222 mnist_ddp_launch.py --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=2,3 python -m torch.distributed.launch --nproc_per_node=2 --nnodes=4 --node_rank=3 --master_addr="10.252.192.49" --master_port=22222 mnist_ddp_launch.py --epochs=1 --batch-size=256

Train Epoch: 0 [20/30]  loss=0.9048
training seconds: 38.14672040939331
best_acc: 64.12

# mp tcp
CUDA_VISIBLE_DEVICES=0,1 python mnist_ddp_mp.py --nproc_per_node=2 --nnodes=4 --node_rank=0 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=2,3 python mnist_ddp_mp.py --nproc_per_node=2 --nnodes=4 --node_rank=1 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=0,1 python mnist_ddp_mp.py --nproc_per_node=2 --nnodes=4 --node_rank=2 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=2,3 python mnist_ddp_mp.py --nproc_per_node=2 --nnodes=4 --node_rank=3 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256

Train Epoch: 0 [20/30]  loss=0.9048
training seconds: 34.46080470085144
best_acc: 64.12

(1.3) 2 machine，nproc_per_node=1，nnodes=8

# mp tcp
CUDA_VISIBLE_DEVICES=0 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=0 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=1 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=1 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=2 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=2 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=3 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=3 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=0 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=4 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=1 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=5 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=2 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=6 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256
CUDA_VISIBLE_DEVICES=3 python mnist_ddp_mp.py --nproc_per_node=1 --nnodes=8 --node_rank=7 --dist-url="tcp://10.252.192.49:22222" --epochs=1 --batch-size=256

Train Epoch: 0 [20/30]  loss=0.9048
training seconds: 42.66786456108093
best_acc: 64.12

The 3 experiments above show that

*When the total processes (`nproc_per_node nnodes) are equal (e.g. 8 in this setting), the training process has no relation with the number of distributed nodesnnodes`**. Because the training loss is reproduced and the test accuracies are equal.

2. PJ DDP traing

When using PJ DDP training, I also want to see the same results of BM.

However, the experiment results makes me confused.

Before doing the same experiment group like BM, I use the recommended way to launch DDP training.

The YAML file is blow.

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "mnist-ddp"
  namespace: "default"
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: pytorch
              image: shuaix/pytorch-dist-mnist:1.0
              imagePullPolicy: IfNotPresent
              command:
                [
                  "python", "mnist_ddp_launch.py", "--epochs=1", "--batch-size=256",
                ]
              resources:
                limits:
                  nvidia.com/gpu: 1
          hostIPC: true
          hostNetwork: true
          dnsPolicy: "ClusterFirstWithHostNet"
    Worker:
      replicas: 7
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: pytorch
              image: shuaix/pytorch-dist-mnist:1.0
              imagePullPolicy: IfNotPresent
              command:
                [
                  "python", "mnist_ddp_launch.py", "--epochs=1", "--batch-size=256",
                ]
              resources:
                limits:
                  nvidia.com/gpu: 1
          hostIPC: true
          hostNetwork: true
          dnsPolicy: "ClusterFirstWithHostNet"

It launch 8 pods, which is similar to the experiment (1.3).

However, I get the results below, which is quite different from BM results.

# pod network
Train Epoch: 0 [20/30]  loss=0.6989
training seconds: 28.12745976448059
best_acc: 67.97

# host network
Train Epoch: 0 [20/30]  loss=0.6989
training seconds: 27.12745976448059
best_acc: 67.97

At first, I doubt BM OS and PytorchJob Pod OS generates different random states.

However, the following experiments show it's not the key.

We set hostNetwork=ture in all the experiments below.

(2.1) 2 Pod * 4 cards

# launch.py
# container command
[
  "sh",
  "-c",
  "python -m torch.distributed.launch --nnodes=2 --nproc_per_node=4 --node_rank=${RANK} --master_addr=10.252.192.48 --master_port=33333 mnist_ddp_launch.py --epochs=1 --batch-size=256",
]

Train Epoch: 0 [20/30]  loss=0.9048         # same to all BM results
training seconds: 48.71152639389038
best_acc: 64.12

# mp tcp
# container command
[
  "sh",
  "-c",
  "python mnist_ddp_mp.py --nnodes=2 --nproc_per_node=4 --node_rank=${RANK} --dist-url=tcp://10.252.192.48:33333 --epochs=1 --batch-size=256",
]

Train Epoch: 0 [20/30]  loss=0.9048         # same to all BM results
training seconds: 51.17721652984619
best_acc: 64.12

(2.2) 4 Pod * 2 cards

# launch.py
# container command
[
  "sh",
  "-c",
  "python -m torch.distributed.launch --nnodes=4 --nproc_per_node=2 --node_rank=${RANK} --master_addr=10.252.192.48 --master_port=33333 mnist_ddp_launch.py --epochs=1 --batch-size=256",
]

Train Epoch: 0 [20/30]  loss=0.8723
training seconds: 48.09228801727295
best_acc: 39.76

# mp tcp
# container command
[
  "sh",
  "-c",
  "python mnist_ddp_mp.py --nnodes=4 --nproc_per_node=2 --node_rank=${RANK} --dist-url=tcp://10.252.192.48:33333 --epochs=1 --batch-size=256",
]

Train Epoch: 0 [20/30]  loss=0.6989         
training seconds: 52.30190896987915
best_acc: 67.97

(2.3) 8 Pod * 1 cards

# launch.py
# container command
[
  "sh",
  "-c",
  "python -m torch.distributed.launch --nnodes=8 --nproc_per_node=1 --node_rank=${RANK} --master_addr=10.252.192.48 --master_port=33333 mnist_ddp_launch.py --epochs=1 --batch-size=256",
]

Train Epoch: 0 [20/30]  loss=0.6989
training seconds: 26.12745976448059
best_acc: 67.97

# mp tcp
# container command
[
  "sh",
  "-c",
  "python mnist_ddp_mp.py --nnodes=8 --nproc_per_node=1 --node_rank=${RANK} --dist-url=tcp://10.252.192.48:33333 --epochs=1 --batch-size=256",
]

Train Epoch: 0 [20/30]  loss=0.8723
training seconds: 52.18285155296326
best_acc: 39.76

Only exp (2.1) gets the same results like BM. It really makes me confused.

Dear developers, please let me know if I made some mistakes.

Thanks a lot.

Happy Mid-Autumn Festival!

gaocegege commented 3 years ago

Interesting. It should be the same since you are using the same dataset and the same random seed.

Can you please post the code here？

/cc @zw0610 @kubeflow/wg-training-leads

Shuai-Xie commented 3 years ago

Sure @gaocegege. I've posted the code here: https://github.com/Shuai-Xie/mnist-pytorchjob-example.

Thanks for your kind reply.

Shuai-Xie commented 3 years ago

Hi @gaocegege, I've got more results and proved that random state isn't the problem.

Check the random values generated on BM and PJ

This function below is used to validate that BM and PJ generate the same random values when using the same seed.

Each process launched by DDP training will invoke this function and print the random values.

def print_random_vals(rank, num=1):
    for n in range(num):
        print('rank {} group {}, random: {},  np: {}, torch: {}'.format(
            rank, n, random.random(), np.random.rand(1), torch.rand(1)))

Here are the results.

(1) Different seeds on BM (A reference)

# seed = 1
# 48
rank 0 group 0, random: 0.13436424411240122,  np: [0.417022], torch: tensor([0.7576])
rank 2 group 0, random: 0.13436424411240122,  np: [0.417022], torch: tensor([0.7576])
rank 3 group 0, random: 0.13436424411240122,  np: [0.417022], torch: tensor([0.7576])
rank 1 group 0, random: 0.13436424411240122,  np: [0.417022], torch: tensor([0.7576])
# 49
rank 5 group 0, random: 0.13436424411240122,  np: [0.417022], torch: tensor([0.7576])
rank 7 group 0, random: 0.13436424411240122,  np: [0.417022], torch: tensor([0.7576])
rank 4 group 0, random: 0.13436424411240122,  np: [0.417022], torch: tensor([0.7576])
rank 6 group 0, random: 0.13436424411240122,  np: [0.417022], torch: tensor([0.7576])

# seed = 10
# 48
rank 2 group 0, random: 0.5714025946899135,  np: [0.77132064], torch: tensor([0.4581])
rank 3 group 0, random: 0.5714025946899135,  np: [0.77132064], torch: tensor([0.4581])
rank 1 group 0, random: 0.5714025946899135,  np: [0.77132064], torch: tensor([0.4581])
rank 0 group 0, random: 0.5714025946899135,  np: [0.77132064], torch: tensor([0.4581])
# 49
rank 7 group 0, random: 0.5714025946899135,  np: [0.77132064], torch: tensor([0.4581])
rank 5 group 0, random: 0.5714025946899135,  np: [0.77132064], torch: tensor([0.4581])
rank 6 group 0, random: 0.5714025946899135,  np: [0.77132064], torch: tensor([0.4581])
rank 4 group 0, random: 0.5714025946899135,  np: [0.77132064], torch: tensor([0.4581])

# seed = 100
# 48
rank 2 group 0, random: 0.1456692551041303,  np: [0.54340494], torch: tensor([0.1117])
rank 3 group 0, random: 0.1456692551041303,  np: [0.54340494], torch: tensor([0.1117])
rank 1 group 0, random: 0.1456692551041303,  np: [0.54340494], torch: tensor([0.1117])
rank 0 group 0, random: 0.1456692551041303,  np: [0.54340494], torch: tensor([0.1117])
# 49
rank 6 group 0, random: 0.1456692551041303,  np: [0.54340494], torch: tensor([0.1117])
rank 7 group 0, random: 0.1456692551041303,  np: [0.54340494], torch: tensor([0.1117])
rank 4 group 0, random: 0.1456692551041303,  np: [0.54340494], torch: tensor([0.1117])
rank 5 group 0, random: 0.1456692551041303,  np: [0.54340494], torch: tensor([0.1117])

(2) Different seeds on PJ (A target)

Still, we turn on the host network and print the random values generated by each process of the DDP training with different numbers of Pods (2/4/8, align to the experiments in 2. PJ DDP training).

Clearly, they are same to BM.

# seed = 1
# 8 Pods
rank 0 group 0, random: 0.13436424411240122,  np: [0.417022], torch: tensor([0.7576])
rank 1 group 0, random: 0.13436424411240122,  np: [0.417022], torch: tensor([0.7576])
...

# 4 Pods
rank 0 group 0, random: 0.13436424411240122,  np: [0.417022], torch: tensor([0.7576])
rank 1 group 0, random: 0.13436424411240122,  np: [0.417022], torch: tensor([0.7576])
...

# 2 Pods
rank 0 group 0, random: 0.13436424411240122,  np: [0.417022], torch: tensor([0.7576])
rank 1 group 0, random: 0.13436424411240122,  np: [0.417022], torch: tensor([0.7576])
...

# seed = 10
# 4 Pods
rank 0 group 0, random: 0.5714025946899135,  np: [0.77132064], torch: tensor([0.4581])
rank 1 group 0, random: 0.5714025946899135,  np: [0.77132064], torch: tensor([0.4581])
...

# seed = 100
# 8 Pods
rank 0 group 0, random: 0.1456692551041303,  np: [0.54340494], torch: tensor([0.1117])
rank 1 group 0, random: 0.1456692551041303,  np: [0.54340494], torch: tensor([0.1117])
...

More PJ training results with different seeds

Still turn on the host network.

(1) seed = 10

# launch.py
# 8 Pod * 1 card
training seconds: 19.675018310546875
best_acc: 55.5

# 4 Pod * 2 card
training seconds: 19.006036043167114
best_acc: 55.5

# 2 Pod * 4 card
training seconds: 19.31661581993103     # same to BM
best_acc: 65.69

# mp tcp
# 8 Pod * 1 card
training seconds: 22.436177015304565
best_acc: 55.5

# 4 Pod * 2 card
training seconds: 21.974145889282227
best_acc: 59.18

# 2 Pod * 4 card
training seconds: 22.730929851531982        # same to BM
best_acc: 65.69

(2) seed=100

# launch.py
# 8 Pod * 1 card
training seconds: 19.475943565368652
best_acc: 61.46

# 4 Pod * 2 card
training seconds: 19.309614658355713
best_acc: 59.18

# 2 Pod * 4 card
training seconds: 20.99683380126953     # same to BM
best_acc: 69.3

# mp tcp
# 8 Pod * 1 card
training seconds: 22.409621953964233
best_acc: 59.18

# 4 Pod * 2 card
training seconds: 22.37164807319641
best_acc: 57.45

# 2 Pod * 4 card
training seconds: 22.644285917282104        # same to BM
best_acc: 69.3

*As before, only `2 Pods 4 card` gets the same results as BM.**

Others get quite random results, for example, the results of 8 Pod * 1 card and 4 Pod * 2 card are sometimes the same, sometimes different.

What confuses me most is that I still get different results even if I use the official recommended way to launch 8 Pods with the same seed. For example, when seed = 1, PJ has a different result to BM no matter the host network is true or false.

BM

Train Epoch: 0 [20/30]  loss=0.9048
training seconds: 22.74862265586853
best_acc: 64.12

PJ


# host network = false
Train Epoch: 0 [20/30]  loss=0.6989
training seconds: 20.179935455322266
best_acc: 67.97

host network = ture

Train Epoch: 0 [20/30] loss=0.6989 training seconds: 18.878486394882202 best_acc: 67.97



For now, I can only doubt that the PJ `allreuce` operation gathers grads different from BM.

Thanks a lot.

Shuai-Xie commented 3 years ago

Things are getting weirder @gaocegege.

PytorchJob version may have an effect on the training reproduction https://github.com/kubeflow/pytorch-operator/issues/355#issue-1001742213.

Please let me know if I write the wrong code. Thanks a lot.

gaocegege commented 3 years ago

Thanks for your detailed reply!

PytorchJob version may have an effect on the training reproduction #355 (comment).

Do you mean PyTorch version?

Shuai-Xie commented 3 years ago

😂 Yes @gaocegege. I'm sorry to make this mistake. I'll change it right away.

By the way, are there any clues now? Many thanks.

kubeflow / pytorch-operator

Different DDP training results of PytorchJob and Bare Metal #354

Experiment settings

Experiment results

1. BM DDP training

(1.1) 2 machine，nproc_per_node=4，nnodes=2

(1.2) 2 machine，nproc_per_node=2，nnodes=4

(1.3) 2 machine，nproc_per_node=1，nnodes=8

2. PJ DDP traing

(2.1) 2 Pod * 4 cards

(2.2) 4 Pod * 2 cards

(2.3) 8 Pod * 1 cards

Check the random values generated on BM and PJ

(1) Different seeds on BM (A reference)

(2) Different seeds on PJ (A target)

More PJ training results with different seeds

(1) seed = 10

(2) seed=100

host network = ture