kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.6k stars 696 forks source link

PyTorchJob does not run #1856

Closed hongbo-miao closed 1 year ago

hongbo-miao commented 1 year ago

I deployed Kubeflow (including Kubeflow Training operator) in a local Kubernetes by

export PIPELINE_VERSION=2.0.0
kubectl apply --kustomize="github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=${PIPELINE_VERSION}"
kubectl wait crd/applications.app.k8s.io --for=condition=established --timeout=60s
kubectl apply --kustomize="github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic?ref=${PIPELINE_VERSION}"

Then I deployed a training job by

kubectl create --filename=https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yaml

It stucks there forever

➜ kubectl get pytorchjobs --namespace=kubeflow
NAME             STATE   AGE
pytorch-simple           30m

Any ideas? Thanks! 😃

johnugeorge commented 1 year ago

Can you check events? Anything in controller logs? Can you do kubectl describe pytorchjobs --namespace=kubeflow

hongbo-miao commented 1 year ago

Thanks @johnugeorge !

I assume you mean workflow controller pod log (?) I recreated the training job, nothing helpful in this controller pod log:

time="2023-07-10T16:51:03.999Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:04.003Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:09.011Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:09.018Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:12.781Z" level=info msg="List workflows 200"
time="2023-07-10T16:51:12.781Z" level=info msg=healthz age=5m0s err="<nil>" instanceID= labelSelector="!workflows.argoproj.io/phase,!workflows.argoproj.io/controller-instanceid" managedNamespace=kubeflow
time="2023-07-10T16:51:14.023Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:14.026Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:19.033Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:19.038Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:24.044Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:24.050Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:29.056Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:29.061Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:30.401Z" level=info msg="Watch configmaps 200"
time="2023-07-10T16:51:31.425Z" level=info msg="Watch workflowtemplates 200"
time="2023-07-10T16:51:34.066Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:34.072Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:39.077Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:39.081Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:44.086Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:44.091Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:49.096Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:49.100Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:54.104Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:54.108Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:58.424Z" level=info msg="Watch workflowtaskresults 200"
time="2023-07-10T16:51:59.114Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:59.119Z" level=info msg="Update leases 200"
time="2023-07-10T16:52:04.125Z" level=info msg="Get leases 200"
time="2023-07-10T16:52:04.130Z" level=info msg="Update leases 200"

And here is the result of kubectl describe pytorchjobs --namespace=kubeflow:

Name:         pytorch-simple
Namespace:    kubeflow
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1
Kind:         PyTorchJob
Metadata:
  Creation Timestamp:  2023-07-10T05:54:56Z
  Generation:          1
  Managed Fields:
    API Version:  kubeflow.org/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:pytorchReplicaSpecs:
          .:
          f:Master:
            .:
            f:replicas:
            f:restartPolicy:
            f:template:
              .:
              f:spec:
                .:
                f:containers:
          f:Worker:
            .:
            f:replicas:
            f:restartPolicy:
            f:template:
              .:
              f:spec:
                .:
                f:containers:
    Manager:         kubectl-create
    Operation:       Update
    Time:            2023-07-10T05:54:56Z
  Resource Version:  6951698
  UID:               12dc5c33-f248-4b0a-81b6-aaa640f331f9
Spec:
  Pytorch Replica Specs:
    Master:
      Replicas:        1
      Restart Policy:  OnFailure
      Template:
        Spec:
          Containers:
            Command:
              python3
              /opt/pytorch-mnist/mnist.py
              --epochs=1
            Image:              docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
            Image Pull Policy:  Always
            Name:               pytorch
    Worker:
      Replicas:        1
      Restart Policy:  OnFailure
      Template:
        Spec:
          Containers:
            Command:
              python3
              /opt/pytorch-mnist/mnist.py
              --epochs=1
            Image:              docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
            Image Pull Policy:  Always
            Name:               pytorch
Events:                         <none>
johnugeorge commented 1 year ago

I meant, training operator pod logs

hongbo-miao commented 1 year ago

Hi @johnugeorge

I am using the default one from Kubeflow based on this doc:

By default, PyTorch Operator will be deployed as a controller in training operator.

I verified I have pytorchjobs.kubeflow.org by: (I don't have it installed, it is because I installed the standalone version before, and deleted the pod)

➜ kubectl get crd
NAME                                                   CREATED AT
...
pytorchjobs.kubeflow.org                               2023-07-10T03:55:47Z
tfjobs.kubeflow.org                                    2023-07-10T03:55:47Z
xgboostjobs.kubeflow.org                               2023-07-10T03:55:48Z

I know if I use standalone training operator, I would have a pod called something like training-operator. Hmm, given I am not using standalone training operator, just wonder which pod log should I print out , thanks!

➜ kubectl get pods -n kubeflow
NAME                                               READY   STATUS    RESTARTS      AGE
metadata-writer-79d569c46f-km7nh                   1/1     Running   0             17h
metadata-envoy-deployment-59687d9798-f2bxl         1/1     Running   0             17h
ml-pipeline-persistenceagent-84f946b944-zcs5d      1/1     Running   0             17h
ml-pipeline-scheduledworkflow-54d88874b-mcd49      1/1     Running   0             17h
ml-pipeline-viewer-crd-75c6d588df-pwd4c            1/1     Running   0             17h
cache-deployer-deployment-779655b9f7-gr9z5         1/1     Running   0             17h
workflow-controller-5f6fdf89d7-pcg2z               1/1     Running   0             17h
ml-pipeline-ui-679784dfd6-c4r4h                    1/1     Running   0             17h
minio-549846c488-pb6q6                             1/1     Running   0             17h
ml-pipeline-visualizationserver-7f8f7fdbdc-w6w6k   1/1     Running   0             17h
mysql-5f968d4688-mtqr4                             1/1     Running   0             17h
cache-server-55c88c76c5-p9hpx                      1/1     Running   0             17h
metadata-grpc-deployment-6d744c66bb-k9w92          1/1     Running   2 (17h ago)   17h
ml-pipeline-867f66dc54-sfc2f                       1/1     Running   1 (17h ago)   17h
johnugeorge commented 1 year ago

There should be a training operator pod when you install Kubeflow. I see that pipelines is the only component that is installed

hongbo-miao commented 1 year ago

Sorry, I guess

export PIPELINE_VERSION=2.0.0
kubectl apply --kustomize="github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=${PIPELINE_VERSION}"
kubectl wait crd/applications.app.k8s.io --for=condition=established --timeout=60s
kubectl apply --kustomize="github.com/kubeflow/pipelines/manifests/kustomize/env/dev?ref=${PIPELINE_VERSION}"

does not install training operator, right?

I originally was confused by this sentence at https://www.kubeflow.org/docs/components/training/pytorch/#installing-pytorch-operator

I thought when install Kubeflow pipelines, it also comes with training operator which is not:

image

I guess after installing Kubeflow pipelines, I have to install training operators separately. Please correct me if I am wrong. I have another question at https://github.com/kubeflow/training-operator/issues/1855 regarding how the version matches.

Anyway, I will try Kubeflow pipelines 2.0 and Kubeflow Training Operator 1.6 see if they work together, and report the results. Thanks!

hongbo-miao commented 1 year ago

Thanks @johnugeorge !

I finally succeed deploying Kubeflow Training Operator 1.6 based on https://github.com/kubeflow/training-operator/issues/1841#issuecomment-1635334868

Here is my scripts

# Install Kubeflow Pipelines
export PIPELINE_VERSION=2.0.0
kubectl apply --kustomize="github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=${PIPELINE_VERSION}"
kubectl wait crd/applications.app.k8s.io --for=condition=established --timeout=60s
kubectl apply --kustomize="github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic?ref=${PIPELINE_VERSION}"

# Install Kubeflow Training Operator
# Steps are at https://github.com/kubeflow/training-operator/issues/1841#issuecomment-1635334868

# Create a PyTorch training job
kubectl create --filename=https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yaml

This demo PyTorch training job succeed finishing:

image

However, the job is not listed in my Kubeflow Pipelines UI:

image

I feel this Kubeflow Training Operator does not connect with my Kubeflow Pipelines correctly. Any ideas? Thanks!

Also, just want to confirm "Kubeflow Pipelines" does not include "Kubeflow Training Operator", right? And they are supposed to deploy individually?

johnugeorge commented 1 year ago

No. Kubeflow pipelines is a ML workflow orchestrator. It is up to you to decide the workflow graph. If you want to see training job inside pipelines UI, you have to trigger the job within a pipeline experiment .

hongbo-miao commented 1 year ago

I see, thank you so much, @johnugeorge !

A demo machine learning code is at https://github.com/Hongbo-Miao/hongbomiao.com/pull/9807/files And I can see it starts to train and show in the UI 😃 image