redhat-developer / odo

odo - Developer-focused CLI for fast & iterative container-based application development on Podman and Kubernetes. Implementation of the open Devfile standard.
https://odo.dev
Apache License 2.0
794 stars 243 forks source link

odo push fails for storage provision on IBM Cloud openshift cluster #4944

Closed anandrkskd closed 3 years ago

anandrkskd commented 3 years ago

/kind failing-test

What versions of software are you using? Operating System: All supported

Output of odo version: odo v2.2.3 (1579dd5be)

How did you run odo exactly?

odo create nodejs tbwdcd 
cp odoexampledir/devfile-with-volume-components.yaml ./devfile.yaml
odo storage list 
odo push  -v4

Actual behavior

``` odo push --context /tmp/035026500 -v4] [odo] Validation [odo] • Validating the devfile ... ✓ Validating the devfile [172561ns] [odo] [odo] Updating services [odo] ✓ Services and Links are in sync with the cluster, no changes are required [odo] [odo] Creating Kubernetes resources for component tbwdcd [odo] ✓ Added storage secondvol to tbwdcd [odo] ✓ Added storage firstvol to tbwdcd [odo] • Waiting for component to start ... ... [odo] I0726 16:14:08.566417 244796 deployments.go:188] Deployment Condition: {"type":"Available","status":"False","lastUpdateTime":"2021-07-26T10:44:08Z","lastTransitionTime":"2021-07-26T10:44:08Z","reason":"MinimumReplicasUnavailable","message":"Deployment does not have minimum availability."} [odo] I0726 16:14:08.566471 244796 deployments.go:188] Deployment Condition: {"type":"Progressing","status":"True","lastUpdateTime":"2021-07-26T10:44:08Z","lastTransitionTime":"2021-07-26T10:44:08Z","reason":"ReplicaSetUpdated","message":"ReplicaSet \"tbwdcd-app-7cbdfb99d\" is progressing."} [odo] I0726 16:14:08.566489 244796 deployments.go:199] Waiting for deployment "tbwdcd-app" rollout to finish: 0 of 1 updated replicas are available... [odo] I0726 16:14:08.566504 244796 deployments.go:206] Waiting for deployment spec update to be observed... [odo] ✗ Waiting for component to start [5m] [odo] ✗ Failed to start component with name "tbwdcd". Error: Failed to create the component: error while waiting for deployment rollout: timeout while waiting for tbwdcd-app deployment roll out ```

Expected behavior odo push should pass

Any logs, error output, etc?

``` oc describe pods -n storagetest Name: tbwdcd-app-54f9848585-9qcmr Namespace: storagetest Priority: 0 Node: Labels: app=app app.kubernetes.io/instance=tbwdcd app.kubernetes.io/managed-by=odo app.kubernetes.io/managed-by-version=v2.2.3 app.kubernetes.io/name=test-devfile app.kubernetes.io/part-of=app component=tbwdcd pod-template-hash=54f9848585 Annotations: openshift.io/scc: restricted Status: Pending IP: IPs: Controlled By: ReplicaSet/tbwdcd-app-54f9848585 Init Containers: copy-supervisord: Image: registry.access.redhat.com/ocp-tools-4/odo-init-container-rhel8:1.1.10 Port: Host Port: Command: /usr/bin/cp Args: -r /opt/odo-init/. /opt/odo/ Environment: Mounts: /opt/odo/ from odo-supervisord-shared-data (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-z9chv (ro) Containers: runtime: Image: quay.io/eclipse/che-nodejs10-ubi:nightly Port: 3000/TCP Host Port: 0/TCP Limits: memory: 1Gi Requests: memory: 1Gi Environment: FOO: bar PROJECTS_ROOT: /projects PROJECT_SOURCE: /projects Mounts: /data from firstvol-tbwdcd-app-vol (rw) /projects from odo-projects (rw) /secondvol from secondvol-tbwdcd-app-vol (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-z9chv (ro) runtime2: Image: quay.io/eclipse/che-nodejs10-ubi:nightly Port: Host Port: Command: /opt/odo/bin/supervisord Args: -c /opt/odo/conf/devfile-supervisor.conf Limits: memory: 1Gi Requests: memory: 1Gi Environment: ODO_COMMAND_RUN: cat myfile.log ODO_COMMAND_RUN_WORKING_DIR: /data Mounts: /data from firstvol-tbwdcd-app-vol (rw) /data2 from secondvol-tbwdcd-app-vol (rw) /opt/odo/ from odo-supervisord-shared-data (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-z9chv (ro) Conditions: Type Status PodScheduled False Volumes: firstvol-tbwdcd-app-vol: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: firstvol-tbwdcd-app ReadOnly: false secondvol-tbwdcd-app-vol: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: secondvol-tbwdcd-app ReadOnly: false odo-projects: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: odo-supervisord-shared-data: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: default-token-z9chv: Type: Secret (a volume populated by a Secret) SecretName: default-token-z9chv Optional: false QoS Class: Burstable Node-Selectors: Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 22m default-scheduler 0/9 nodes are available: 9 pod has unbound immediate PersistentVolumeClaims. Warning FailedScheduling 22m default-scheduler 0/9 nodes are available: 9 pod has unbound immediate PersistentVolumeClaims. Warning FailedScheduling 20m default-scheduler 0/9 nodes are available: 9 node(s) had volume node affinity conflict. Normal NotTriggerScaleUp 2m40s (x21 over 22m) cluster-autoscaler pod didn't trigger scale-up: ```

Acceptance Criteria

kadel commented 3 years ago
Warning  FailedScheduling   22m                   default-scheduler   0/9 nodes are available: 9 pod has unbound immediate PersistentVolumeClaims.
  Warning  FailedScheduling   22m                   default-scheduler   0/9 nodes are available: 9 pod has unbound immediate PersistentVolumeClaims.
  Warning  FailedScheduling   20m                   default-scheduler   0/9 nodes are available: 9 node(s) had volume node affinity conflict.

This suggests something is happening with cluster storage setup.

What storage classes are configured on the cluster?

What happens if yo try to create simple deployment with PVC like this?

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: test-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: test
  name: test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test
  template:
    metadata:
      labels:
        app: test
    spec:
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: test-pvc
      containers:
        - image: busybox
          name: busybox
          command:
            - "sleep"
            - "infinity"
          resources: {}
          volumeMounts:
            - name: data
              mountPath: /data

and than getting info about it

kubectl describe deployments.apps test 

kubectl describe pvc test-pvc 
anandrkskd commented 3 years ago

What storage classes are configured on the cluster?

On default its ibmc-vpc-block-10iops-tier, I think for storage on IBMcloud we are using Block storage for VPC

What happens if yo try to create simple deployment with PVC like this?

For the simple deployment with PVC you shared works fine. Pods were able to spin up.

kubectl describe deployments.apps test
Name:                   test
Namespace:              default
CreationTimestamp:      Wed, 28 Jul 2021 09:58:47 +0530
Labels:                 app=test
Annotations:            deployment.kubernetes.io/revision: 1
Selector:               app=test
Replicas:               1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType:           RollingUpdate
MinReadySeconds:        0
RollingUpdateStrategy:  25% max unavailable, 25% max surge
Pod Template:
  Labels:  app=test
  Containers:
   busybox:
    Image:      busybox
    Port:       <none>
    Host Port:  <none>
    Command:
      sleep
      infinity
    Environment:  <none>
    Mounts:
      /data from data (rw)
  Volumes:
   data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  test-pvc
    ReadOnly:   false
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    True    NewReplicaSetAvailable
OldReplicaSets:  <none>
NewReplicaSet:   test-9d46557cd (1/1 replicas created)
Events:
  Type    Reason             Age    From                   Message
  ----    ------             ----   ----                   -------
  Normal  ScalingReplicaSet  2m50s  deployment-controller  Scaled up replica set test-9d46557cd to 1
kubectl describe pvc test-pvc \
> 
Name:          test-pvc
Namespace:     default
StorageClass:  ibmc-vpc-block-10iops-tier
Status:        Bound
Volume:        pvc-74e4a5e0-1ab2-4c22-80f0-a07bc199c727
Labels:        <none>
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: vpc.block.csi.ibm.io
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      10Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Used By:       test-9d46557cd-kp7wn
Events:
  Type    Reason                 Age                From                                                                                      Message
  ----    ------                 ----               ----                                                                                      -------
  Normal  Provisioning           92s                vpc.block.csi.ibm.io_ibm-vpc-block-csi-controller-0_cf26be28-fd68-4e3c-9ca6-2bd0e898563b  External provisioner is provisioning volume for claim "default/test-pvc"
  Normal  ExternalProvisioning   19s (x6 over 92s)  persistentvolume-controller                                                               waiting for a volume to be created, either by external provisioner "vpc.block.csi.ibm.io" or manually created by system administrator
  Normal  ProvisioningSucceeded  13s                vpc.block.csi.ibm.io_ibm-vpc-block-csi-controller-0_cf26be28-fd68-4e3c-9ca6-2bd0e898563b  Successfully provisioned volume pvc-74e4a5e0-1ab2-4c22-80f0-a07bc199c727

Whenever we are using two pv that are created in two different zones then the pods are failing with error volume node affinity conflict and pods fail to spin up. for example

-> oc get pv                                    
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                                 STORAGECLASS                 REASON   AGE
pvc-0f9b8ffb-4661-42f7-a372-3a41e148669e   10Gi       RWO            Delete           Bound    cmd-devfile-storage-test228ubx/secondvol-awhmll-app   ibmc-vpc-block-10iops-tier            14h
pvc-22e0a37c-4f2b-4abc-a4cb-617b1f1eca14   10Gi       RWO            Delete           Bound    storage-test/firstvol-test-devfile-app                ibmc-vpc-block-10iops-tier            19h
pvc-3092082f-2b76-4e01-b433-af74815363ea   10Gi       RWO            Delete           Bound    cmd-devfile-storage-test228pxs/secondvol-xadwvi-app   ibmc-vpc-block-10iops-tier            14h
pvc-3c236748-4c64-45e2-8fe8-ebdff6431b54   10Gi       RWO            Delete           Bound    storage-test/secondvol-test-devfile-app               ibmc-vpc-block-10iops-tier            19h
pvc-74e4a5e0-1ab2-4c22-80f0-a07bc199c727   10Gi       RWO            Delete           Bound    default/test-pvc                                      ibmc-vpc-block-10iops-tier            15m
pvc-94beff15-e0f9-40b3-b52a-cc17e8a2c6c1   10Gi       RWO            Delete           Bound    cmd-devfile-storage-test228ubx/firstvol-awhmll-app    ibmc-vpc-block-10iops-tier            14h
pvc-a63a6c91-35c5-43fe-9e70-c4166ded7ac6   10Gi       RWO            Delete           Bound    cmd-devfile-storage-test228pxs/firstvol-xadwvi-app    ibmc-vpc-block-10iops-tier            14h

-> oc get pv pvc-0f9b8ffb-4661-42f7-a372-3a41e148669e -o yaml    
apiVersion: v1
kind: PersistentVolume
...
nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: failure-domain.beta.kubernetes.io/region
          operator: In
          values:
          - eu-de
        - key: failure-domain.beta.kubernetes.io/zone
          operator: In
          values:
          - eu-de-1
  persistentVolumeReclaimPolicy: Delete
  storageClassName: ibmc-vpc-block-10iops-tier
  volumeMode: Filesystem
status:
  phase: Bound

->  oc get pv pvc-94beff15-e0f9-40b3-b52a-cc17e8a2c6c1 -o yaml
apiVersion: v1
kind: PersistentVolume
...
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: failure-domain.beta.kubernetes.io/region
          operator: In
          values:
          - eu-de
        - key: failure-domain.beta.kubernetes.io/zone
          operator: In
          values:
          - eu-de-1
  persistentVolumeReclaimPolicy: Delete
  storageClassName: ibmc-vpc-block-10iops-tier
  volumeMode: Filesystem
status:
  phase: Bound

and for filing case its

oc get pv pvc-3092082f-2b76-4e01-b433-af74815363ea -o yaml
apiVersion: v1
kind: PersistentVolume
...
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: failure-domain.beta.kubernetes.io/region
          operator: In
          values:
          - eu-de
        - key: failure-domain.beta.kubernetes.io/zone
          operator: In
          values:
          - eu-de-3
  persistentVolumeReclaimPolicy: Delete
  storageClassName: ibmc-vpc-block-10iops-tier
  volumeMode: Filesystem
status:
  phase: Bound

-> % oc get pv pvc-a63a6c91-35c5-43fe-9e70-c4166ded7ac6 -o yaml
apiVersion: v1
kind: PersistentVolume
...
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: failure-domain.beta.kubernetes.io/region
          operator: In
          values:
          - eu-de
        - key: failure-domain.beta.kubernetes.io/zone
          operator: In
          values:
          - eu-de-1
  persistentVolumeReclaimPolicy: Delete
  storageClassName: ibmc-vpc-block-10iops-tier
  volumeMode: Filesystem
status:
  phase: Bound
kadel commented 3 years ago

Whenever we are using two pv that are created in two different zones then the pods are failing with error volume node affinity conflict and pods fail to spin up.

this looks like infra issue

anandrkskd commented 3 years ago

I was thinking if we can create the cluster on one zone only, we can avoid this problem, I guess there should be an option to select only one zone to create a cluster.

@rnapoles-rh does IBMCloud gives us such option while creating a cluster?

dharmit commented 3 years ago

I was thinking if we can create the cluster on one zone only, we can avoid this problem

This could be the solution that helps us fix it. Saying based on this answer on stackoverflow. We don't need to spread our cluster among zones for CI, do we? It's not a long running cluster anyway, right?

anandrkskd commented 3 years ago

We don't need to spread our cluster among zones for CI, do we?

No, but I am currently not sure if IBMcloud have any option to select zone before creating cluster.

It's not a long running cluster anyway, right?

No, it will be a long running cluster just like the cluster we have on PSI

rnapoles-rh commented 3 years ago

I was thinking if we can create the cluster on one zone only, we can avoid this problem, I guess there should be an option to select only one zone to create a cluster.

@rnapoles-rh does IBMCloud gives us such option while creating a cluster?

@anandrkskd yes, we can have clusters using only one zone instead of three. The Cluster named devtools-1-4vcpu-16gb-3w was provisioned using only one zone (eu-de-1) . You should be able to see it in the IBM Cloud web UI under OpenShift clusters.

anandrkskd commented 3 years ago

@rnapoles-rh Running 10 consecutive test for the above situation passed successfully on the cluster with only one zone.

rnapoles-rh commented 3 years ago

@rnapoles-rh Running 10 consecutive test for the above situation passed successfully on the cluster with only one zone.

@anandrkskd This is perfect, can we close this issue now?

anandrkskd commented 3 years ago

@rnapoles-rh yes, we can close this issue.

dharmit commented 3 years ago

@rnapoles-rh yes, we can close this issue.

Maybe close it as well?

/remove-triage ready /triage support /close

openshift-ci[bot] commented 3 years ago

@dharmit: Closing this issue.

In response to [this](https://github.com/openshift/odo/issues/4944#issuecomment-891851000): >> @rnapoles-rh yes, we can close this issue. > >Maybe close it as well? > >/remove-triage ready >/triage support >/close > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.