PytorchJob replicas has different node affinity behaviors compared with Deployment

Shuai-Xie commented 3 years ago

Hello.

Dear developers, I find a problem when using pytorchjob.

Problem

I notice that PytorchJob replica pods don't obey the scheduling rules set in the node affinity. All the pods of a pytorchjob replica tend to be scheduled on the same node. And the preferred weights set in the node affinity seem to not affect.

Example: PytorchJob vs. Deployment

For example, the PytorchJob and Deployment replica pods are expected to be scheduled to A1 and A2 nodes, and 1 Pod is on A1 while 2 Pods are on A2.

The Yaml files are below.

Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  replicas: 3
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      containers:
      - name: nginx
        image: nginx
        imagePullPolicy: IfNotPresent
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: machine
                operator: In
                values:
                - A1
                - A2
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 1
              preference:
                matchExpressions:
                - key: machine
                  operator: In
                  values:
                  - A1
            - weight: 2
              preference:
                matchExpressions:
                - key: machine
                  operator: In
                  values:
                  - A2

PytorchJob

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "pytorch-dist-mnist-gloo"
  namespace: "default"
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: pytorch-dist-mnist:1.0
              imagePullPolicy: IfNotPresent       
              resources: 
                limits:
                  nvidia.com/gpu: 1       
    Worker:
      replicas: 3
      restartPolicy: OnFailure
      template:
        spec:
          containers: 
            - name: pytorch
              image: pytorch-dist-mnist:1.0
              imagePullPolicy: IfNotPresent
              resources: 
                limits:
                  nvidia.com/gpu: 1
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: machine
                    operator: In
                    values:
                    - A1
                    - A2
              preferredDuringSchedulingIgnoredDuringExecution:
                - weight: 1
                  preference:
                    matchExpressions:
                    - key: machine
                      operator: In
                      values:
                      - A1
                - weight: 2
                  preference:
                    matchExpressions:
                    - key: machine
                      operator: In
                      values:
                      - A2

It can be seen in the result that the preferred weights in the node affinity work well in Deployment but fail in PytorchJob.

I guess maybe the default pytorchjob replica pods scheduling strategy tends to schedule pods to as fewer nodes ad possible so that it can benefit from the GPU communication when distributed training.

$ k get pods -o wide
NAME                               READY   STATUS         RESTARTS   AGE   IP               NODE
# deployment replicas ok
myapp-798878df64-7r4wp             1/1     Running        0          3s    10.100.103.143   A1
myapp-798878df64-qc6nt             1/1     Running        0          3s    10.100.103.148   A2
myapp-798878df64-wr7l6             1/1     Running        0          3s    10.100.254.60    A2
# pytochjob replicas all on A2
pytorch-dist-mnist-gloo-worker-0   1/1     Running        0          61s   10.100.103.172   A2
pytorch-dist-mnist-gloo-worker-1   1/1     Running        0          61s   10.100.103.144   A2
pytorch-dist-mnist-gloo-worker-2   1/1     Running        0          61s   10.100.103.146   A2

Thanks a lot.

gaocegege commented 3 years ago

Can you please show us kubectl get pods pytorch-dist-mnist-gloo-worker-0 -o yaml

Shuai-Xie commented 3 years ago

Sure. The YAML response is below. Thanks a lot.

apiVersion: v1
kind: Pod
metadata:
  annotations:
    cni.projectcalico.org/podIP: 10.100.103.167/32
    cni.projectcalico.org/podIPs: 10.100.103.167/32
    sidecar.istio.io/inject: "false"
  creationTimestamp: "2021-07-22T02:16:05Z"
  labels:
    controller-name: pytorch-operator
    group-name: kubeflow.org
    job-name: pytorch-dist-mnist-gloo
    pytorch-job-name: pytorch-dist-mnist-gloo
    pytorch-replica-index: "0"
    pytorch-replica-type: worker
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:sidecar.istio.io/inject: {}
        f:labels:
          .: {}
          f:controller-name: {}
          f:group-name: {}
          f:job-name: {}
          f:pytorch-job-name: {}
          f:pytorch-replica-index: {}
          f:pytorch-replica-type: {}
        f:ownerReferences:
          .: {}
          k:{"uid":"773627d5-b463-45c9-9a17-134aec4c2b80"}:
            .: {}
            f:apiVersion: {}
            f:blockOwnerDeletion: {}
            f:controller: {}
            f:kind: {}
            f:name: {}
            f:uid: {}
      f:spec:
        f:affinity:
          .: {}
          f:nodeAffinity:
            .: {}
            f:preferredDuringSchedulingIgnoredDuringExecution: {}
            f:requiredDuringSchedulingIgnoredDuringExecution:
              .: {}
              f:nodeSelectorTerms: {}
        f:containers:
          k:{"name":"pytorch"}:
            .: {}
            f:args: {}
            f:env:
              .: {}
              k:{"name":"MASTER_ADDR"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"MASTER_PORT"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"PYTHONUNBUFFERED"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"RANK"}:
                .: {}
                f:name: {}
                f:value: {}
              k:{"name":"WORLD_SIZE"}:
                .: {}
                f:name: {}
                f:value: {}
            f:image: {}
            f:imagePullPolicy: {}
            f:name: {}
            f:resources:
              .: {}
              f:limits:
                .: {}
                f:nvidia.com/gpu: {}
              f:requests:
                .: {}
                f:nvidia.com/gpu: {}
            f:terminationMessagePath: {}
            f:terminationMessagePolicy: {}
        f:dnsPolicy: {}
        f:enableServiceLinks: {}
        f:initContainers:
          .: {}
          k:{"name":"init-pytorch"}:
            .: {}
            f:command: {}
            f:image: {}
            f:imagePullPolicy: {}
            f:name: {}
            f:resources:
              .: {}
              f:limits:
                .: {}
                f:cpu: {}
                f:memory: {}
              f:requests:
                .: {}
                f:cpu: {}
                f:memory: {}
            f:terminationMessagePath: {}
            f:terminationMessagePolicy: {}
        f:restartPolicy: {}
        f:schedulerName: {}
        f:securityContext: {}
        f:terminationGracePeriodSeconds: {}
    manager: pytorch-operator.v1
    operation: Update
    time: "2021-07-22T02:16:05Z"
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:cni.projectcalico.org/podIP: {}
          f:cni.projectcalico.org/podIPs: {}
    manager: calico
    operation: Update
    time: "2021-07-22T02:16:06Z"
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
          k:{"type":"ContainersReady"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:status: {}
            f:type: {}
          k:{"type":"Initialized"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:status: {}
            f:type: {}
          k:{"type":"Ready"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:status: {}
            f:type: {}
        f:containerStatuses: {}
        f:hostIP: {}
        f:initContainerStatuses: {}
        f:phase: {}
        f:podIP: {}
        f:podIPs:
          .: {}
          k:{"ip":"10.100.103.167"}:
            .: {}
            f:ip: {}
        f:startTime: {}
    manager: kubelet
    operation: Update
    time: "2021-07-22T02:16:38Z"
  name: pytorch-dist-mnist-gloo-worker-0
  namespace: default
  ownerReferences:
  - apiVersion: kubeflow.org/v1
    blockOwnerDeletion: true
    controller: true
    kind: PyTorchJob
    name: pytorch-dist-mnist-gloo
    uid: 773627d5-b463-45c9-9a17-134aec4c2b80
  resourceVersion: "21438615"
  selfLink: /api/v1/namespaces/default/pods/pytorch-dist-mnist-gloo-worker-0
  uid: c7c73a69-a9c2-415d-aa14-56c93a24bc1b
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
          matchExpressions:
          - key: machine
            operator: In
            values:
            - A1
        weight: 1
      - preference:
          matchExpressions:
          - key: machine
            operator: In
            values:
            - A2
        weight: 2
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: machine
            operator: In
            values:
            - A1
            - A2
  containers:
  - args:
    - --backend
    - gloo
    - --epochs
    - "2"
    env:
    - name: MASTER_PORT
      value: "23456"
    - name: MASTER_ADDR
      value: pytorch-dist-mnist-gloo-master-0
    - name: WORLD_SIZE
      value: "4"
    - name: RANK
      value: "1"
    - name: PYTHONUNBUFFERED
      value: "0"
    image: shuaix/pytorch-dist-mnist:1.0
    imagePullPolicy: IfNotPresent
    name: pytorch
    resources:
      limits:
        nvidia.com/gpu: "1"
      requests:
        nvidia.com/gpu: "1"
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-p2txv
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  initContainers:
  - command:
    - sh
    - -c
    - until nslookup pytorch-dist-mnist-gloo-master-0; do echo waiting for master;
      sleep 2; done;
    image: alpine:3.10
    imagePullPolicy: IfNotPresent
    name: init-pytorch
    resources:
      limits:
        cpu: 100m
        memory: 20Mi
      requests:
        cpu: 50m
        memory: 10Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-p2txv
      readOnly: true
  nodeName: A2
  priority: 0
  restartPolicy: OnFailure
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: default-token-p2txv
    secret:
      defaultMode: 420
      secretName: default-token-p2txv
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-07-22T02:16:59Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2021-07-22T02:17:00Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2021-07-22T02:17:00Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2021-07-22T02:16:05Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: docker://8126bb20e0426584ca420352cc9684b25a555700dde4cba8cb242f6d3bb875c5
    image: shuaix/pytorch-dist-mnist:1.0
    imageID: docker-pullable://shuaix/pytorch-dist-mnist@sha256:e2b5a55c6a2c372620f951584e888e0f933b5a6c14f918f38ede10bd6de3f47c
    lastState: {}
    name: pytorch
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2021-07-22T02:16:59Z"
  hostIP: 10.252.192.43
  initContainerStatuses:
  - containerID: docker://d51564e9ee09fa847e245ead062b40db9764b5df776b5d819a1f4542744dfa89
    image: alpine:3.10
    imageID: docker-pullable://alpine@sha256:451eee8bedcb2f029756dc3e9d73bab0e7943c1ac55cff3a4861c52a0fdd3e98
    lastState: {}
    name: init-pytorch
    ready: true
    restartCount: 0
    state:
      terminated:
        containerID: docker://d51564e9ee09fa847e245ead062b40db9764b5df776b5d819a1f4542744dfa89
        exitCode: 0
        finishedAt: "2021-07-22T02:16:59Z"
        reason: Completed
        startedAt: "2021-07-22T02:16:28Z"
  phase: Running
  podIP: 10.100.103.167
  podIPs:
  - ip: 10.100.103.167
  qosClass: Burstable
  startTime: "2021-07-22T02:16:26Z"

gaocegege commented 3 years ago

As shown in the pod spec, the nodeaffinity is set. Thus I think it should be executed by the scheduler. Is there enough resource in A1?

Shuai-Xie commented 3 years ago

Yes. Both A1 and A2 have 4 unallocated GPUs.

$ k describe nodes

Name:               A1
...
Capacity:
  cpu:                48
  ephemeral-storage:  22888456Mi
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             263722000Ki
  nvidia.com/gpu:     4
  pods:               110
Allocatable:
  cpu:                48
  ephemeral-storage:  21600257039028
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             263619600Ki
  nvidia.com/gpu:     4
  pods:               110
Non-terminated Pods:          (11 in total)
  Namespace                   Name                                                               CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                                               ------------  ----------  ---------------  -------------  ---
  default                     dcgm-exporter-1625661096-p8qwm                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         14d
  ingress-nginx               nginx1                                                             0 (0%)        0 (0%)      0 (0%)           0 (0%)         43h
  ingress-nginx               nginx2                                                             0 (0%)        0 (0%)      0 (0%)           0 (0%)         43h
  kube-system                 calico-node-7x6t4                                                  250m (0%)     0 (0%)      0 (0%)           0 (0%)         14d
  kube-system                 kube-proxy-7ldzj                                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         14d
  kube-system                 nvidia-device-plugin-daemonset-2f4tq                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         14d
  logging                     elasticsearch-logging-1                                            100m (0%)     1 (2%)      3Gi (1%)         3Gi (1%)       44h
  logging                     fluentd-v2.8.0-9pg47                                               100m (0%)     0 (0%)      200Mi (0%)       500Mi (0%)     147m
  prometheus                  alertmanager-kube-prometheus-stack-1625-alertmanager-0             100m (0%)     100m (0%)   250Mi (0%)       50Mi (0%)      2d1h
  prometheus                  kube-prometheus-stack-1625-operator-764ddc77-2hk4p                 0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d1h
  prometheus                  kube-prometheus-stack-1625714272-prometheus-node-exporter-xps7v    0 (0%)        0 (0%)      0 (0%)           0 (0%)         14d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                550m (1%)    1100m (2%)
  memory             3522Mi (1%)  3622Mi (1%)
  ephemeral-storage  0 (0%)       0 (0%)
  hugepages-1Gi      0 (0%)       0 (0%)
  hugepages-2Mi      0 (0%)       0 (0%)
  nvidia.com/gpu     0            0
Events:              <none>

Name:               A2
...
Capacity:
  cpu:                48
  ephemeral-storage:  22888456Mi
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             263722000Ki
  nvidia.com/gpu:     4
  pods:               110
Allocatable:
  cpu:                48
  ephemeral-storage:  21600257039028
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             263619600Ki
  nvidia.com/gpu:     4
  pods:               110
Non-terminated Pods:          (13 in total)
  Namespace                   Name                                                               CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                                               ------------  ----------  ---------------  -------------  ---
  default                     dcgm-exporter-1625661096-bxggr                                     0 (0%)        0 (0%)      0 (0%)           0 (0%)         14d
  elastic-job                 etcd                                                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         9d
  istio-system                cluster-local-gateway-6b6cb58745-fzqbr                             100m (0%)     2 (4%)      128Mi (0%)       1Gi (0%)       7d16h
  knative-serving             autoscaler-5888bf7697-gj989                                        30m (0%)      300m (0%)   40Mi (0%)        400Mi (0%)     3d1h
  knative-serving             istio-webhook-7db84bf7bf-d5jc5                                     20m (0%)      200m (0%)   20Mi (0%)        200Mi (0%)     7d16h
  knative-serving             networking-istio-55d86868c6-wzh6h                                  30m (0%)      300m (0%)   40Mi (0%)        400Mi (0%)     7d16h
  kube-system                 calico-node-rgmfx                                                  250m (0%)     0 (0%)      0 (0%)           0 (0%)         14d
  kube-system                 kube-proxy-8rv4g                                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         14d
  kube-system                 nvidia-device-plugin-daemonset-wdkbt                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         14d
  kubeflow                    kfserving-controller-manager-0                                     100m (0%)     100m (0%)   200Mi (0%)       300Mi (0%)     7d16h
  logging                     fluentd-v2.8.0-z65nz                                               100m (0%)     0 (0%)      200Mi (0%)       500Mi (0%)     147m
  logging                     kibana-7d5cc86845-ntz9t                                            100m (0%)     1 (2%)      0 (0%)           0 (0%)         44h
  prometheus                  kube-prometheus-stack-1625714272-prometheus-node-exporter-wsqv4    0 (0%)        0 (0%)      0 (0%)           0 (0%)         14d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                730m (1%)   3900m (8%)
  memory             628Mi (0%)  2824Mi (1%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
  nvidia.com/gpu     0           0
Events:              <none>

kubeflow / pytorch-operator

PytorchJob replicas has different node affinity behaviors compared with Deployment #344

Problem

Example: PytorchJob vs. Deployment