kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.58k stars 687 forks source link

master pod not getting started for pytorch job #2034

Open bharathappali opened 6 months ago

bharathappali commented 6 months ago

I'm trying to run training operator standalone on openshift cluster with katib. When I apply a pytorch job the worker pods are getting created but for some reason the master pods are not getting started.

Here is the events log of the worker pod:

Events:
  Type     Reason          Age                    From               Message
  ----     ------          ----                   ----               -------
  Normal   Scheduled       9m35s                  default-scheduler  Successfully assigned sampler/random-exp-jw6qxmrm-worker-0 to acorvin-hpo-poc-jfrlm-worker-0-twvtz
  Normal   AddedInterface  9m33s                  multus             Add eth0 [10.131.5.61/23] from openshift-sdn
  Normal   Pulling         9m33s                  kubelet            Pulling image "quay.io/bharathappali/alpine:3.10"
  Normal   Pulled          9m32s                  kubelet            Successfully pulled image "quay.io/bharathappali/alpine:3.10" in 1.065165424s (1.065174057s including waiting)
  Warning  BackOff         2m49s                  kubelet            Back-off restarting failed container init-pytorch in pod random-exp-jw6qxmrm-worker-0_sampler(8d6860a7-204d-45c8-bb57-8d84a6cf8e66)
  Normal   Created         2m34s (x3 over 9m31s)  kubelet            Created container init-pytorch
  Normal   Started         2m34s (x3 over 9m31s)  kubelet            Started container init-pytorch
  Normal   Pulled          2m34s (x2 over 6m11s)  kubelet            Container image "quay.io/bharathappali/alpine:3.10" already present on machine

I have changed the init container image due to docker pull limits issue

Here is the pod log:

nslookup: can't resolve 'random-exp-jw6qxmrm-master-0': Name does not resolve
waiting for master
nslookup: can't resolve '(null)': Name does not resolve

Here is the pytorch experiment I'm deploying

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: random-exp
  namespace: sampler
spec:
  maxTrialCount: 25
  parallelTrialCount: 3
  maxFailedTrialCount: 3
  resumePolicy: Never
  objective:
    type: maximize
    goal: 0.9
    objectiveMetricName: accuracy
    additionalMetricNames: []
  algorithm:
    algorithmName: bayesianoptimization
    algorithmSettings:
      - name: base_estimator
        value: GP
      - name: n_initial_points
        value: '10'
      - name: acq_func
        value: gp_hedge
      - name: acq_optimizer
        value: auto
  parameters:
    - name: lr
      parameterType: double
      feasibleSpace:
        min: '0.01'
        max: '0.03'
        step: '0.01'
  metricsCollectorSpec:
    collector:
      kind: StdOut
  trialTemplate:
    primaryContainerName: pytorch
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
    retain: false
    trialParameters:
      - name: learningRate
        reference: lr
        description: ''
    trialSpec:
      apiVersion: kubeflow.org/v1
      kind: PyTorchJob
      spec:
        pytorchReplicaSpecs:
          Master:
            replicas: 1
            restartPolicy: OnFailure
            template:
              spec:
                containers:
                  - name: pytorch
                    image: quay.io/bharathappali/pytorch-mnist-cpu:v0.16.0
                    resources:
                      limits:
                        cpu: "1"
                        memory: "2Gi"
                      requests:
                        cpu: "1"
                        memory: "1Gi"
                    command:
                      - python3
                      - /opt/pytorch-mnist/mnist.py
                      - '--epochs=1'
                      - '--lr=${trialParameters.learningRate}'
                      - '--momentum=0.5'
          Worker:
            replicas: 1
            restartPolicy: OnFailure
            template:
              spec:
                containers:
                  - name: pytorch
                    image: quay.io/bharathappali/pytorch-mnist-cpu:v0.16.0
                    resources:
                      limits:
                        cpu: "1"
                        memory: "2Gi"
                      requests:
                        cpu: "1"
                        memory: "1Gi"
                    command:
                      - python3
                      - /opt/pytorch-mnist/mnist.py
                      - '--epochs=1'
                      - '--lr=${trialParameters.learningRate}'
                      - '--momentum=0.5'
johnugeorge commented 6 months ago

Is master not up ? random-exp-jw6qxmrm-master-0 doesn't resolve

bharathappali commented 6 months ago

Yes the master pod is not getting scheduled. I see workers init failure and it shows crashloopbackoff

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

andreyvelich commented 3 months ago

@bharathappali Sorry for the late reply, can you try to create your PyTorchJob without Katib Experiment ?

github-actions[bot] commented 5 days ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.