kubeflow / pytorch-operator

PyTorch on Kubernetes
Apache License 2.0
307 stars 143 forks source link

Failed to deploy pytorch operator #206

Closed xiaqunfeng closed 5 years ago

xiaqunfeng commented 5 years ago

Pytorch-operator pod return CrashLoopBackOff when I try to deploy it. The output information is as follows:

  1. crd info
    $ kubectl get crd
    NAME                           AGE
    pytorchjobs.kubeflow.org       8s
  2. create pytorch-operator
    $ ks apply default -c pytorch-operator
    INFO Applying customresourcedefinitions pytorchjobs.kubeflow.org
    INFO Creating non-existent customresourcedefinitions pytorchjobs.kubeflow.org
    INFO Applying serviceaccounts default.pytorch-operator
    INFO Creating non-existent serviceaccounts default.pytorch-operator
    INFO Applying clusterroles pytorch-operator
    INFO Creating non-existent clusterroles pytorch-operator
    INFO Applying clusterrolebindings pytorch-operator
    INFO Creating non-existent clusterrolebindings pytorch-operator
    INFO Applying services default.pytorch-operator
    INFO Creating non-existent services default.pytorch-operator
    INFO Applying deployments default.pytorch-operator
    INFO Creating non-existent deployments default.pytorch-operator
  3. get pod
    $ kubectl get pod | grep pytorch
    pytorch-operator-db5d78f97-b4tmx   0/1       CrashLoopBackOff   5          3m
  4. check pod describe
    $ kubectl describe pod pytorch-operator-db5d78f97-b4tmx
    ...
    Events:
    Type     Reason                 Age              From                                                Message
    ----     ------                 ----             ----                                                -------
    Normal   Scheduled              4m               default-scheduler                                   Successfully assigned pytorch-operator-db5d78f97-b4tmx to xxx
    Normal   SuccessfulMountVolume  4m               kubelet, xxx  MountVolume.SetUp succeeded for volume "pytorch-operator-token-7kzq6"
    Normal   Pulled                 3m (x5 over 4m)  kubelet, xxx  Container image "local-image-rep/pytorch-operator:vv" already present on machine
    Normal   Created                3m (x5 over 4m)  kubelet, xxx  Created container
    Normal   Started                3m (x5 over 4m)  kubelet, xxx
    Warning  BackOff                2m (x9 over 4m)  kubelet, xxx  Back-off restarting failed container
  5. check log
    $ kubectl logs pytorch-operator-db5d78f97-b4tmx
    {"filename":"app/server.go:73","level":"info","msg":"EnvKubeflowNamespace not set, use default namespace","time":"2019-08-15T13:16:38Z"}
    {"filename":"app/server.go:78","level":"info","msg":"[API Version: v1 Version: v0.1.0-alpha Git SHA: Not provided. Go Version: go1.12 Go OS/Arch: linux/amd64]","time":"2019-08-15T13:16:38Z"}
    W0815 13:16:38.983656       1 client_config.go:552] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
    {"filename":"pytorch-operator.v1/main.go:33","level":"info","msg":"Setting up client for monitoring on port: 8443","time":"2019-08-15T13:16:38Z"}
    {"filename":"app/server.go:200","level":"error","msg":"the server could not find the requested resource (get pytorchjobs.kubeflow.org)","time":"2019-08-15T13:16:38Z"}
    {"filename":"app/server.go:102","level":"info","msg":"CRD doesn't exist. Exiting","time":"2019-08-15T13:16:38Z"}

Error occurred in the log:

"error","msg":"the server could not find the requested resource (get pytorchjobs.kubeflow.org)"

I don't know what can I do next to solve this problem. Looking forward to some help.

johnugeorge commented 5 years ago

what is the API version?

kubectl get crd pytorchjobs.kubeflow.org -o yaml

xiaqunfeng commented 5 years ago

what is the API version?

kubectl get crd pytorchjobs.kubeflow.org -o yaml

apiVersion: apiextensions.k8s.io/v1beta1

johnugeorge commented 5 years ago

I think, this is the problem. CRD version looks old. Did you install previous versions before?

Can you try deleting the operator again? Ensure that crd is also deleted. And redeploy again

xiaqunfeng commented 5 years ago

I think, this is the problem. CRD version looks old. Did you install previous versions before?

Can you try deleting the operator again? Ensure that crd is also deleted. And redeploy again

  1. I haven't installed previous version before
  2. I delete the operator, and the crd is deleted too. When I redeploy again, the same error. my deploy command pipline:
    ks init kubeflow-pytorch
    cd kubeflow-pytorch
    ks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflow
    ks pkg install kubeflow/pytorch-job
    ks generate pytorch-operator pytorch-operator
    ks apply default -c pytorch-operator
johnugeorge commented 5 years ago

This is strange. CRD is not matching the version in the master. https://github.com/kubeflow/kubeflow/blob/master/kubeflow/pytorch-job/pytorch-operator.libsonnet#L48

ks show default -c pytorch-operator

Btw, we are not using ksonnet anymore. From 0.6, we moved from ksonnet to kustomize. https://github.com/kubeflow/manifests/tree/master/pytorch-job

xiaqunfeng commented 5 years ago

This is strange. CRD is not matching the version in the master. kubeflow/kubeflow:kubeflow/pytorch-job/pytorch-operator.libsonnet@master#L48

ks show default -c pytorch-operator

The output of CMD ks show default -c pytorch-operator is as follows:

$ ks show default -c pytorch-operator
---
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  labels:
    ksonnet.io/component: pytorch-operator
  name: pytorchjobs.kubeflow.org
spec:
  additionalPrinterColumns:
  - JSONPath: .status.conditions[-1:].type
    name: State
    type: string
  - JSONPath: .metadata.creationTimestamp
    name: Age
    type: date
  group: kubeflow.org
  names:
    kind: PyTorchJob
    plural: pytorchjobs
    singular: pytorchjob
  scope: Namespaced
  subresources:
    status: {}
  validation:
    openAPIV3Schema:
      properties:
        spec:
          properties:
            pytorchReplicaSpecs:
              properties:
                Master:
                  properties:
                    replicas:
                      maximum: 1
                      minimum: 1
                      type: integer
                Worker:
                  properties:
                    replicas:
                      minimum: 1
                      type: integer
  version: v1
  versions:
  - name: v1
    served: true
    storage: true
  - name: v1beta2
    served: true
    storage: false
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  labels:
    app: pytorch-operator
    ksonnet.io/component: pytorch-operator
  name: pytorch-operator
rules:
- apiGroups:
  - kubeflow.org
  resources:
  - pytorchjobs
  - pytorchjobs/status
  verbs:
  - '*'
- apiGroups:
  - ""
  resources:
  - pods
  - services
  - endpoints
  - events
  verbs:
  - '*'
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  labels:
    app: pytorch-operator
    ksonnet.io/component: pytorch-operator
  name: pytorch-operator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: pytorch-operator
subjects:
- kind: ServiceAccount
  name: pytorch-operator
  namespace: default
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/path: /metrics
    prometheus.io/port: "8443"
    prometheus.io/scrape: "true"
  labels:
    app: pytorch-operator
    ksonnet.io/component: pytorch-operator
  name: pytorch-operator
  namespace: default
spec:
  ports:
  - name: monitoring-port
    port: 8443
    targetPort: 8443
  selector:
    name: pytorch-operator
  type: ClusterIP
---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app: pytorch-operator
    ksonnet.io/component: pytorch-operator
  name: pytorch-operator
  namespace: default
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    ksonnet.io/component: pytorch-operator
  name: pytorch-operator
  namespace: default
spec:
  replicas: 1
  template:
    metadata:
      labels:
        name: pytorch-operator
    spec:
      containers:
      - command:
        - /pytorch-operator.v1
        - --alsologtostderr
        - -v=1
        - --monitoring-port=8443
        env:
        - name: MY_POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        image: local-image-rep/k8s/pytorch-operator:v1.0
        name: pytorch-operator
      serviceAccountName: pytorch-operator

It show me that:

apiVersion: apiextensions.k8s.io/v1beta1 spec->version: v1

What's the difference between crd->apiVersion and crd->spec->versions(as you mentioned in pytorch-operator.libsonnet#L48)? Do they have to be the same?

xiaqunfeng commented 5 years ago

Btw, we are not using ksonnet anymore. From 0.6, we moved from ksonnet to kustomize. kubeflow/manifests:pytorch-job@master

I try to use kustomize deploy pytorch operator. I create crd failed. Is there something wrong with the way I use it?

  1. create crd
    kustomize/manifests/pytorch-job/pytorch-job-crds/base$ kubectl create -f crd.yaml
    The CustomResourceDefinition "pytorchjobs.kubeflow.org" is invalid: spec.version: Required value
  2. create pytorch operator
    
    kustomize/manifests/pytorch-job/pytorch-operator/base$ kustomize build | tee outputfile.yaml

kustomize/manifests/pytorch-job/pytorch-operator/base$ kubectl create -f outputfile.yaml serviceaccount "pytorch-operator" created clusterrole.rbac.authorization.k8s.io "pytorch-operator" created clusterrolebinding.rbac.authorization.k8s.io "pytorch-operator" created configmap "pytorch-operator-config" created service "pytorch-operator" created deployment.extensions "pytorch-operator" created

$ kubectl get pod NAME READY STATUS RESTARTS AGE pytorch-operator-847fbbc96b-rv87q 0/1 CrashLoopBackOff 5 3m

$ kubectl logs pytorch-operator-847fbbc96b-rv87q ... {"filename":"app/server.go:200","level":"error","msg":"the server could not find the requested resource (get pytorchjobs.kubeflow.org)","time":"2019-08-16T09:58:56Z"}


the same error.
johnugeorge commented 5 years ago

which k8s cluster version are you using?

xiaqunfeng commented 5 years ago

which k8s cluster version are you using?

$ kubelet --version
Kubernetes v1.10.2
johnugeorge commented 5 years ago

Support for multiple CRD versions required k8s v1.11.0 or higher. Can you upgrade k8s and try?

xiaqunfeng commented 5 years ago

Support for multiple CRD versions required k8s v1.11.0 or higher. Can you upgrade k8s and try?

I redeployed it under k8s v1.14.4, and it worked. Thank you for your patient reply.

gaocegege commented 5 years ago

/close

Thanks

@johnugeorge @xiaqunfeng

k8s-ci-robot commented 5 years ago

@gaocegege: Closing this issue.

In response to [this](https://github.com/kubeflow/pytorch-operator/issues/206#issuecomment-522501836): >/close > >Thanks > >@johnugeorge @xiaqunfeng Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
lmxia commented 4 years ago

I encounter the same problem on kubernetes 1.16.6 I install only pytorch-operator but the operator turn out to

"error","msg":"the server could not find the requested resource (get pytorchjobs.kubeflow.org)"

lmxia commented 4 years ago

well, i figured it out. It's the image tag did't match the crd. the image tag should be vmaster-g047cf0f ragher than v0.6.0