Failed to deploy pytorch operator

xiaqunfeng commented 5 years ago

Pytorch-operator pod return CrashLoopBackOff when I try to deploy it. The output information is as follows:

crd info

$ kubectl get crd
NAME                           AGE
pytorchjobs.kubeflow.org       8s

create pytorch-operator

$ ks apply default -c pytorch-operator
INFO Applying customresourcedefinitions pytorchjobs.kubeflow.org
INFO Creating non-existent customresourcedefinitions pytorchjobs.kubeflow.org
INFO Applying serviceaccounts default.pytorch-operator
INFO Creating non-existent serviceaccounts default.pytorch-operator
INFO Applying clusterroles pytorch-operator
INFO Creating non-existent clusterroles pytorch-operator
INFO Applying clusterrolebindings pytorch-operator
INFO Creating non-existent clusterrolebindings pytorch-operator
INFO Applying services default.pytorch-operator
INFO Creating non-existent services default.pytorch-operator
INFO Applying deployments default.pytorch-operator
INFO Creating non-existent deployments default.pytorch-operator

get pod

$ kubectl get pod | grep pytorch
pytorch-operator-db5d78f97-b4tmx   0/1       CrashLoopBackOff   5          3m

check pod describe

$ kubectl describe pod pytorch-operator-db5d78f97-b4tmx
...
Events:
Type     Reason                 Age              From                                                Message
----     ------                 ----             ----                                                -------
Normal   Scheduled              4m               default-scheduler                                   Successfully assigned pytorch-operator-db5d78f97-b4tmx to xxx
Normal   SuccessfulMountVolume  4m               kubelet, xxx  MountVolume.SetUp succeeded for volume "pytorch-operator-token-7kzq6"
Normal   Pulled                 3m (x5 over 4m)  kubelet, xxx  Container image "local-image-rep/pytorch-operator:vv" already present on machine
Normal   Created                3m (x5 over 4m)  kubelet, xxx  Created container
Normal   Started                3m (x5 over 4m)  kubelet, xxx
Warning  BackOff                2m (x9 over 4m)  kubelet, xxx  Back-off restarting failed container

check log

$ kubectl logs pytorch-operator-db5d78f97-b4tmx
{"filename":"app/server.go:73","level":"info","msg":"EnvKubeflowNamespace not set, use default namespace","time":"2019-08-15T13:16:38Z"}
{"filename":"app/server.go:78","level":"info","msg":"[API Version: v1 Version: v0.1.0-alpha Git SHA: Not provided. Go Version: go1.12 Go OS/Arch: linux/amd64]","time":"2019-08-15T13:16:38Z"}
W0815 13:16:38.983656       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
{"filename":"pytorch-operator.v1/main.go:33","level":"info","msg":"Setting up client for monitoring on port: 8443","time":"2019-08-15T13:16:38Z"}
{"filename":"app/server.go:200","level":"error","msg":"the server could not find the requested resource (get pytorchjobs.kubeflow.org)","time":"2019-08-15T13:16:38Z"}
{"filename":"app/server.go:102","level":"info","msg":"CRD doesn't exist. Exiting","time":"2019-08-15T13:16:38Z"}

Error occurred in the log:

"error","msg":"the server could not find the requested resource (get pytorchjobs.kubeflow.org)"

I don't know what can I do next to solve this problem. Looking forward to some help.

johnugeorge commented 5 years ago

what is the API version?

kubectl get crd pytorchjobs.kubeflow.org -o yaml

xiaqunfeng commented 5 years ago

what is the API version?

kubectl get crd pytorchjobs.kubeflow.org -o yaml

apiVersion: apiextensions.k8s.io/v1beta1

johnugeorge commented 5 years ago

I think, this is the problem. CRD version looks old. Did you install previous versions before?

Can you try deleting the operator again? Ensure that crd is also deleted. And redeploy again

xiaqunfeng commented 5 years ago

I think, this is the problem. CRD version looks old. Did you install previous versions before?

Can you try deleting the operator again? Ensure that crd is also deleted. And redeploy again

I haven't installed previous version before

I delete the operator, and the crd is deleted too. When I redeploy again, the same error. my deploy command pipline:

ks init kubeflow-pytorch
cd kubeflow-pytorch
ks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflow
ks pkg install kubeflow/pytorch-job
ks generate pytorch-operator pytorch-operator
ks apply default -c pytorch-operator

johnugeorge commented 5 years ago

This is strange. CRD is not matching the version in the master. https://github.com/kubeflow/kubeflow/blob/master/kubeflow/pytorch-job/pytorch-operator.libsonnet#L48

ks show default -c pytorch-operator

Btw, we are not using ksonnet anymore. From 0.6, we moved from ksonnet to kustomize. https://github.com/kubeflow/manifests/tree/master/pytorch-job

xiaqunfeng commented 5 years ago

This is strange. CRD is not matching the version in the master. kubeflow/kubeflow:kubeflow/pytorch-job/pytorch-operator.libsonnet@master#L48

ks show default -c pytorch-operator

The output of CMD ks show default -c pytorch-operator is as follows:

$ ks show default -c pytorch-operator
---
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  labels:
    ksonnet.io/component: pytorch-operator
  name: pytorchjobs.kubeflow.org
spec:
  additionalPrinterColumns:
  - JSONPath: .status.conditions[-1:].type
    name: State
    type: string
  - JSONPath: .metadata.creationTimestamp
    name: Age
    type: date
  group: kubeflow.org
  names:
    kind: PyTorchJob
    plural: pytorchjobs
    singular: pytorchjob
  scope: Namespaced
  subresources:
    status: {}
  validation:
    openAPIV3Schema:
      properties:
        spec:
          properties:
            pytorchReplicaSpecs:
              properties:
                Master:
                  properties:
                    replicas:
                      maximum: 1
                      minimum: 1
                      type: integer
                Worker:
                  properties:
                    replicas:
                      minimum: 1
                      type: integer
  version: v1
  versions:
  - name: v1
    served: true
    storage: true
  - name: v1beta2
    served: true
    storage: false
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  labels:
    app: pytorch-operator
    ksonnet.io/component: pytorch-operator
  name: pytorch-operator
rules:
- apiGroups:
  - kubeflow.org
  resources:
  - pytorchjobs
  - pytorchjobs/status
  verbs:
  - '*'
- apiGroups:
  - ""
  resources:
  - pods
  - services
  - endpoints
  - events
  verbs:
  - '*'
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  labels:
    app: pytorch-operator
    ksonnet.io/component: pytorch-operator
  name: pytorch-operator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: pytorch-operator
subjects:
- kind: ServiceAccount
  name: pytorch-operator
  namespace: default
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/path: /metrics
    prometheus.io/port: "8443"
    prometheus.io/scrape: "true"
  labels:
    app: pytorch-operator
    ksonnet.io/component: pytorch-operator
  name: pytorch-operator
  namespace: default
spec:
  ports:
  - name: monitoring-port
    port: 8443
    targetPort: 8443
  selector:
    name: pytorch-operator
  type: ClusterIP
---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app: pytorch-operator
    ksonnet.io/component: pytorch-operator
  name: pytorch-operator
  namespace: default
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    ksonnet.io/component: pytorch-operator
  name: pytorch-operator
  namespace: default
spec:
  replicas: 1
  template:
    metadata:
      labels:
        name: pytorch-operator
    spec:
      containers:
      - command:
        - /pytorch-operator.v1
        - --alsologtostderr
        - -v=1
        - --monitoring-port=8443
        env:
        - name: MY_POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        image: local-image-rep/k8s/pytorch-operator:v1.0
        name: pytorch-operator
      serviceAccountName: pytorch-operator

It show me that:

apiVersion: apiextensions.k8s.io/v1beta1 spec->version: v1

What's the difference between crd->apiVersion and crd->spec->versions(as you mentioned in pytorch-operator.libsonnet#L48)? Do they have to be the same?

xiaqunfeng commented 5 years ago

Btw, we are not using ksonnet anymore. From 0.6, we moved from ksonnet to kustomize. kubeflow/manifests:pytorch-job@master

I try to use kustomize deploy pytorch operator. I create crd failed. Is there something wrong with the way I use it?

create crd

kustomize/manifests/pytorch-job/pytorch-job-crds/base$ kubectl create -f crd.yaml
The CustomResourceDefinition "pytorchjobs.kubeflow.org" is invalid: spec.version: Required value

create pytorch operator


kustomize/manifests/pytorch-job/pytorch-operator/base$ kustomize build | tee outputfile.yaml

kustomize/manifests/pytorch-job/pytorch-operator/base$ kubectl create -f outputfile.yaml serviceaccount "pytorch-operator" created clusterrole.rbac.authorization.k8s.io "pytorch-operator" created clusterrolebinding.rbac.authorization.k8s.io "pytorch-operator" created configmap "pytorch-operator-config" created service "pytorch-operator" created deployment.extensions "pytorch-operator" created

$ kubectl get pod NAME READY STATUS RESTARTS AGE pytorch-operator-847fbbc96b-rv87q 0/1 CrashLoopBackOff 5 3m

$ kubectl logs pytorch-operator-847fbbc96b-rv87q ... {"filename":"app/server.go:200","level":"error","msg":"the server could not find the requested resource (get pytorchjobs.kubeflow.org)","time":"2019-08-16T09:58:56Z"}


the same error.

johnugeorge commented 5 years ago

which k8s cluster version are you using?

xiaqunfeng commented 5 years ago

which k8s cluster version are you using?

$ kubelet --version
Kubernetes v1.10.2

johnugeorge commented 5 years ago

Support for multiple CRD versions required k8s v1.11.0 or higher. Can you upgrade k8s and try?

xiaqunfeng commented 5 years ago

Support for multiple CRD versions required k8s v1.11.0 or higher. Can you upgrade k8s and try?

I redeployed it under k8s v1.14.4, and it worked. Thank you for your patient reply.

gaocegege commented 5 years ago

/close

Thanks

@johnugeorge @xiaqunfeng

k8s-ci-robot commented 5 years ago

@gaocegege: Closing this issue.

In response to [this](https://github.com/kubeflow/pytorch-operator/issues/206#issuecomment-522501836): >/close > >Thanks > >@johnugeorge @xiaqunfeng Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

lmxia commented 4 years ago

I encounter the same problem on kubernetes 1.16.6 I install only pytorch-operator but the operator turn out to

"error","msg":"the server could not find the requested resource (get pytorchjobs.kubeflow.org)"

lmxia commented 4 years ago

well, i figured it out. It's the image tag did't match the crd. the image tag should be vmaster-g047cf0f ragher than v0.6.0

kubeflow / pytorch-operator

Failed to deploy pytorch operator #206