Closed hongbo-miao closed 1 year ago
Can you check events? Anything in controller logs? Can you do kubectl describe pytorchjobs --namespace=kubeflow
Thanks @johnugeorge !
I assume you mean workflow controller pod log (?) I recreated the training job, nothing helpful in this controller pod log:
time="2023-07-10T16:51:03.999Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:04.003Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:09.011Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:09.018Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:12.781Z" level=info msg="List workflows 200"
time="2023-07-10T16:51:12.781Z" level=info msg=healthz age=5m0s err="<nil>" instanceID= labelSelector="!workflows.argoproj.io/phase,!workflows.argoproj.io/controller-instanceid" managedNamespace=kubeflow
time="2023-07-10T16:51:14.023Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:14.026Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:19.033Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:19.038Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:24.044Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:24.050Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:29.056Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:29.061Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:30.401Z" level=info msg="Watch configmaps 200"
time="2023-07-10T16:51:31.425Z" level=info msg="Watch workflowtemplates 200"
time="2023-07-10T16:51:34.066Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:34.072Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:39.077Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:39.081Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:44.086Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:44.091Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:49.096Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:49.100Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:54.104Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:54.108Z" level=info msg="Update leases 200"
time="2023-07-10T16:51:58.424Z" level=info msg="Watch workflowtaskresults 200"
time="2023-07-10T16:51:59.114Z" level=info msg="Get leases 200"
time="2023-07-10T16:51:59.119Z" level=info msg="Update leases 200"
time="2023-07-10T16:52:04.125Z" level=info msg="Get leases 200"
time="2023-07-10T16:52:04.130Z" level=info msg="Update leases 200"
And here is the result of kubectl describe pytorchjobs --namespace=kubeflow
:
Name: pytorch-simple
Namespace: kubeflow
Labels: <none>
Annotations: <none>
API Version: kubeflow.org/v1
Kind: PyTorchJob
Metadata:
Creation Timestamp: 2023-07-10T05:54:56Z
Generation: 1
Managed Fields:
API Version: kubeflow.org/v1
Fields Type: FieldsV1
fieldsV1:
f:spec:
.:
f:pytorchReplicaSpecs:
.:
f:Master:
.:
f:replicas:
f:restartPolicy:
f:template:
.:
f:spec:
.:
f:containers:
f:Worker:
.:
f:replicas:
f:restartPolicy:
f:template:
.:
f:spec:
.:
f:containers:
Manager: kubectl-create
Operation: Update
Time: 2023-07-10T05:54:56Z
Resource Version: 6951698
UID: 12dc5c33-f248-4b0a-81b6-aaa640f331f9
Spec:
Pytorch Replica Specs:
Master:
Replicas: 1
Restart Policy: OnFailure
Template:
Spec:
Containers:
Command:
python3
/opt/pytorch-mnist/mnist.py
--epochs=1
Image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
Image Pull Policy: Always
Name: pytorch
Worker:
Replicas: 1
Restart Policy: OnFailure
Template:
Spec:
Containers:
Command:
python3
/opt/pytorch-mnist/mnist.py
--epochs=1
Image: docker.io/kubeflowkatib/pytorch-mnist:v1beta1-45c5727
Image Pull Policy: Always
Name: pytorch
Events: <none>
I meant, training operator pod logs
Hi @johnugeorge
I am using the default one from Kubeflow based on this doc:
By default, PyTorch Operator will be deployed as a controller in training operator.
I verified I have (I don't have it installed, it is because I installed the standalone version before, and deleted the pod)pytorchjobs.kubeflow.org
by:
➜ kubectl get crd
NAME CREATED AT
...
pytorchjobs.kubeflow.org 2023-07-10T03:55:47Z
tfjobs.kubeflow.org 2023-07-10T03:55:47Z
xgboostjobs.kubeflow.org 2023-07-10T03:55:48Z
I know if I use standalone training operator, I would have a pod called something like training-operator. Hmm, given I am not using standalone training operator, just wonder which pod log should I print out , thanks!
➜ kubectl get pods -n kubeflow
NAME READY STATUS RESTARTS AGE
metadata-writer-79d569c46f-km7nh 1/1 Running 0 17h
metadata-envoy-deployment-59687d9798-f2bxl 1/1 Running 0 17h
ml-pipeline-persistenceagent-84f946b944-zcs5d 1/1 Running 0 17h
ml-pipeline-scheduledworkflow-54d88874b-mcd49 1/1 Running 0 17h
ml-pipeline-viewer-crd-75c6d588df-pwd4c 1/1 Running 0 17h
cache-deployer-deployment-779655b9f7-gr9z5 1/1 Running 0 17h
workflow-controller-5f6fdf89d7-pcg2z 1/1 Running 0 17h
ml-pipeline-ui-679784dfd6-c4r4h 1/1 Running 0 17h
minio-549846c488-pb6q6 1/1 Running 0 17h
ml-pipeline-visualizationserver-7f8f7fdbdc-w6w6k 1/1 Running 0 17h
mysql-5f968d4688-mtqr4 1/1 Running 0 17h
cache-server-55c88c76c5-p9hpx 1/1 Running 0 17h
metadata-grpc-deployment-6d744c66bb-k9w92 1/1 Running 2 (17h ago) 17h
ml-pipeline-867f66dc54-sfc2f 1/1 Running 1 (17h ago) 17h
There should be a training operator pod when you install Kubeflow. I see that pipelines is the only component that is installed
Sorry, I guess
export PIPELINE_VERSION=2.0.0
kubectl apply --kustomize="github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=${PIPELINE_VERSION}"
kubectl wait crd/applications.app.k8s.io --for=condition=established --timeout=60s
kubectl apply --kustomize="github.com/kubeflow/pipelines/manifests/kustomize/env/dev?ref=${PIPELINE_VERSION}"
does not install training operator, right?
I originally was confused by this sentence at https://www.kubeflow.org/docs/components/training/pytorch/#installing-pytorch-operator
I thought when install Kubeflow pipelines, it also comes with training operator which is not:
I guess after installing Kubeflow pipelines, I have to install training operators separately. Please correct me if I am wrong. I have another question at https://github.com/kubeflow/training-operator/issues/1855 regarding how the version matches.
Anyway, I will try Kubeflow pipelines 2.0 and Kubeflow Training Operator 1.6 see if they work together, and report the results. Thanks!
Thanks @johnugeorge !
I finally succeed deploying Kubeflow Training Operator 1.6 based on https://github.com/kubeflow/training-operator/issues/1841#issuecomment-1635334868
Here is my scripts
# Install Kubeflow Pipelines
export PIPELINE_VERSION=2.0.0
kubectl apply --kustomize="github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=${PIPELINE_VERSION}"
kubectl wait crd/applications.app.k8s.io --for=condition=established --timeout=60s
kubectl apply --kustomize="github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic?ref=${PIPELINE_VERSION}"
# Install Kubeflow Training Operator
# Steps are at https://github.com/kubeflow/training-operator/issues/1841#issuecomment-1635334868
# Create a PyTorch training job
kubectl create --filename=https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yaml
This demo PyTorch training job succeed finishing:
However, the job is not listed in my Kubeflow Pipelines UI:
I feel this Kubeflow Training Operator does not connect with my Kubeflow Pipelines correctly. Any ideas? Thanks!
Also, just want to confirm "Kubeflow Pipelines" does not include "Kubeflow Training Operator", right? And they are supposed to deploy individually?
No. Kubeflow pipelines is a ML workflow orchestrator. It is up to you to decide the workflow graph. If you want to see training job inside pipelines UI, you have to trigger the job within a pipeline experiment .
I see, thank you so much, @johnugeorge !
A demo machine learning code is at https://github.com/Hongbo-Miao/hongbomiao.com/pull/9807/files And I can see it starts to train and show in the UI 😃
I deployed Kubeflow (including Kubeflow Training operator) in a local Kubernetes by
Then I deployed a training job by
It stucks there forever
Any ideas? Thanks! 😃