Closed xiaqunfeng closed 5 years ago
what is the API version?
kubectl get crd pytorchjobs.kubeflow.org -o yaml
what is the API version?
kubectl get crd pytorchjobs.kubeflow.org -o yaml
apiVersion: apiextensions.k8s.io/v1beta1
I think, this is the problem. CRD version looks old. Did you install previous versions before?
Can you try deleting the operator again? Ensure that crd is also deleted. And redeploy again
I think, this is the problem. CRD version looks old. Did you install previous versions before?
Can you try deleting the operator again? Ensure that crd is also deleted. And redeploy again
ks init kubeflow-pytorch
cd kubeflow-pytorch
ks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflow
ks pkg install kubeflow/pytorch-job
ks generate pytorch-operator pytorch-operator
ks apply default -c pytorch-operator
This is strange. CRD is not matching the version in the master. https://github.com/kubeflow/kubeflow/blob/master/kubeflow/pytorch-job/pytorch-operator.libsonnet#L48
ks show default -c pytorch-operator
Btw, we are not using ksonnet anymore. From 0.6, we moved from ksonnet to kustomize. https://github.com/kubeflow/manifests/tree/master/pytorch-job
This is strange. CRD is not matching the version in the master. kubeflow/kubeflow:kubeflow/pytorch-job/pytorch-operator.libsonnet@
master
#L48ks show default -c pytorch-operator
The output of CMD ks show default -c pytorch-operator
is as follows:
$ ks show default -c pytorch-operator
---
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
labels:
ksonnet.io/component: pytorch-operator
name: pytorchjobs.kubeflow.org
spec:
additionalPrinterColumns:
- JSONPath: .status.conditions[-1:].type
name: State
type: string
- JSONPath: .metadata.creationTimestamp
name: Age
type: date
group: kubeflow.org
names:
kind: PyTorchJob
plural: pytorchjobs
singular: pytorchjob
scope: Namespaced
subresources:
status: {}
validation:
openAPIV3Schema:
properties:
spec:
properties:
pytorchReplicaSpecs:
properties:
Master:
properties:
replicas:
maximum: 1
minimum: 1
type: integer
Worker:
properties:
replicas:
minimum: 1
type: integer
version: v1
versions:
- name: v1
served: true
storage: true
- name: v1beta2
served: true
storage: false
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
labels:
app: pytorch-operator
ksonnet.io/component: pytorch-operator
name: pytorch-operator
rules:
- apiGroups:
- kubeflow.org
resources:
- pytorchjobs
- pytorchjobs/status
verbs:
- '*'
- apiGroups:
- ""
resources:
- pods
- services
- endpoints
- events
verbs:
- '*'
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
labels:
app: pytorch-operator
ksonnet.io/component: pytorch-operator
name: pytorch-operator
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: pytorch-operator
subjects:
- kind: ServiceAccount
name: pytorch-operator
namespace: default
---
apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "8443"
prometheus.io/scrape: "true"
labels:
app: pytorch-operator
ksonnet.io/component: pytorch-operator
name: pytorch-operator
namespace: default
spec:
ports:
- name: monitoring-port
port: 8443
targetPort: 8443
selector:
name: pytorch-operator
type: ClusterIP
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app: pytorch-operator
ksonnet.io/component: pytorch-operator
name: pytorch-operator
namespace: default
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
ksonnet.io/component: pytorch-operator
name: pytorch-operator
namespace: default
spec:
replicas: 1
template:
metadata:
labels:
name: pytorch-operator
spec:
containers:
- command:
- /pytorch-operator.v1
- --alsologtostderr
- -v=1
- --monitoring-port=8443
env:
- name: MY_POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
image: local-image-rep/k8s/pytorch-operator:v1.0
name: pytorch-operator
serviceAccountName: pytorch-operator
It show me that:
apiVersion: apiextensions.k8s.io/v1beta1 spec->version: v1
What's the difference between crd->apiVersion and crd->spec->versions(as you mentioned in pytorch-operator.libsonnet#L48)? Do they have to be the same?
Btw, we are not using ksonnet anymore. From 0.6, we moved from ksonnet to kustomize. kubeflow/manifests:pytorch-job@
master
I try to use kustomize deploy pytorch operator. I create crd failed. Is there something wrong with the way I use it?
kustomize/manifests/pytorch-job/pytorch-job-crds/base$ kubectl create -f crd.yaml
The CustomResourceDefinition "pytorchjobs.kubeflow.org" is invalid: spec.version: Required value
kustomize/manifests/pytorch-job/pytorch-operator/base$ kustomize build | tee outputfile.yaml
kustomize/manifests/pytorch-job/pytorch-operator/base$ kubectl create -f outputfile.yaml serviceaccount "pytorch-operator" created clusterrole.rbac.authorization.k8s.io "pytorch-operator" created clusterrolebinding.rbac.authorization.k8s.io "pytorch-operator" created configmap "pytorch-operator-config" created service "pytorch-operator" created deployment.extensions "pytorch-operator" created
$ kubectl get pod NAME READY STATUS RESTARTS AGE pytorch-operator-847fbbc96b-rv87q 0/1 CrashLoopBackOff 5 3m
$ kubectl logs pytorch-operator-847fbbc96b-rv87q ... {"filename":"app/server.go:200","level":"error","msg":"the server could not find the requested resource (get pytorchjobs.kubeflow.org)","time":"2019-08-16T09:58:56Z"}
the same error.
which k8s cluster version are you using?
which k8s cluster version are you using?
$ kubelet --version
Kubernetes v1.10.2
Support for multiple CRD versions required k8s v1.11.0 or higher. Can you upgrade k8s and try?
Support for multiple CRD versions required k8s v1.11.0 or higher. Can you upgrade k8s and try?
I redeployed it under k8s v1.14.4, and it worked. Thank you for your patient reply.
/close
Thanks
@johnugeorge @xiaqunfeng
@gaocegege: Closing this issue.
I encounter the same problem on kubernetes 1.16.6 I install only pytorch-operator but the operator turn out to
"error","msg":"the server could not find the requested resource (get pytorchjobs.kubeflow.org)"
well, i figured it out. It's the image tag did't match the crd. the image tag should be vmaster-g047cf0f ragher than v0.6.0
Pytorch-operator pod return CrashLoopBackOff when I try to deploy it. The output information is as follows:
Error occurred in the log:
I don't know what can I do next to solve this problem. Looking forward to some help.