kubernetes-sigs / aws-ebs-csi-driver

CSI driver for Amazon EBS https://aws.amazon.com/ebs/
Apache License 2.0
974 stars 787 forks source link

aws-ebs-csi driver creates pvc in unexpected aws region us-west-2b (expected: us-west-2a) #1443

Closed dmitry-mightydevops closed 1 year ago

dmitry-mightydevops commented 1 year ago

/kind bug

What happened?

I have a node labeled etcd=true in us-west-2a running etcd as statefulset inside the eks 1.21 I have provisioned aws ebs csi driver with

however I always get PVC/PV in us-west-2b region AccessibleTopology:[segments:<key:"topology.ebs.csi.aws.com/zone" value:"us-west-2b" >

as a result my etcd-0 pod is always in pending state

➜ kdno -l etcd                                  
Name:               ip-10-110-2-122.us-west-2.compute.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=t3.medium
                    beta.kubernetes.io/os=linux
                    databases=true
                    efs=true
                    etcd=true
                    failure-domain.beta.kubernetes.io/region=us-west-2
                    failure-domain.beta.kubernetes.io/zone=us-west-2a
                    k8s.io/cloud-provider-aws=a5c5e390134d813e05190dc61b3f53b6
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-110-2-122.us-west-2.compute.internal
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=t3.medium
                    node.kubernetes.io/lifecycle=on-demand
                    topology.ebs.csi.aws.com/zone=us-west-2a
                    topology.kubernetes.io/region=us-west-2
                    topology.kubernetes.io/zone=us-west-2a
                    workload=databases
Annotations:        csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0b24bb6849f331fd7"}
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 02 Nov 2022 16:45:36 -0500
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-110-2-122.us-west-2.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Sun, 06 Nov 2022 16:54:56 -0600
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Sun, 06 Nov 2022 16:50:34 -0600   Wed, 02 Nov 2022 16:45:36 -0500   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Sun, 06 Nov 2022 16:50:34 -0600   Wed, 02 Nov 2022 16:45:36 -0500   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Sun, 06 Nov 2022 16:50:34 -0600   Wed, 02 Nov 2022 16:45:36 -0500   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Sun, 06 Nov 2022 16:50:34 -0600   Wed, 02 Nov 2022 16:46:16 -0500   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   10.110.2.122
  Hostname:     ip-10-110-2-122.us-west-2.compute.internal
  InternalDNS:  ip-10-110-2-122.us-west-2.compute.internal
Capacity:
  attachable-volumes-aws-ebs:  25
  cpu:                         2
  ephemeral-storage:           157274092Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      3965424Ki
  pods:                        17
Allocatable:
  attachable-volumes-aws-ebs:  25
  cpu:                         1930m
  ephemeral-storage:           143870061124
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      3410416Ki
  pods:                        17
System Info:
  Machine ID:                 ec20c0dbb3d85aae2b130c3e87f7cd60
  System UUID:                ec20c0db-b3d8-5aae-2b13-0c3e87f7cd60
  Boot ID:                    0ea8d0b4-6483-41bc-aafe-51cf057a8ce3
  Kernel Version:             5.4.217-126.408.amzn2.x86_64
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://20.10.17
  Kubelet Version:            v1.21.14-eks-ba74326
  Kube-Proxy Version:         v1.21.14-eks-ba74326
ProviderID:                   aws:///us-west-2a/i-0b24bb6849f331fd7
Non-terminated Pods:          (9 in total)
  Namespace                   Name                                  CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                  ------------  ----------  ---------------  -------------  ---
  influxdb                    influxdb-influxdb2-0                  500m (25%)    1 (51%)     500Mi (15%)      1Gi (30%)      3d23h
  kube-system                 aws-node-nxp7g                        25m (1%)      0 (0%)      0 (0%)           0 (0%)         3d2h
  kube-system                 coredns-85d5b4454c-ltv5p              100m (5%)     0 (0%)      70Mi (2%)        170Mi (5%)     4d1h
  kube-system                 ebs-csi-node-xfswb                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         13m
  kube-system                 kube-proxy-mw4fh                      100m (5%)     0 (0%)      0 (0%)           0 (0%)         4d1h
  monitoring                  promtail-hk4jj                        100m (5%)     512m (26%)  128Mi (3%)       512Mi (15%)    2d20h
  prometheus                  prometheus-node-exporter-dfczh        0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d20h
  staging                     staging-jupyter-v1-dd84f5bb8-xxmvm    0 (0%)        1 (51%)     0 (0%)           2Gi (61%)      2d
  teleport-cluster            teleport-775d4574d7-mtdh7             0 (0%)        0 (0%)      0 (0%)           0 (0%)         4d1h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests     Limits
  --------                    --------     ------
  cpu                         825m (42%)   2512m (130%)
  memory                      698Mi (20%)  3754Mi (112%)
  ephemeral-storage           0 (0%)       0 (0%)
  hugepages-1Gi               0 (0%)       0 (0%)
  hugepages-2Mi               0 (0%)       0 (0%)
  attachable-volumes-aws-ebs  0            0
Events:                       <none>

kdpo etcd-0

Events:
  Type     Reason            Age                    From               Message
  ----     ------            ----                   ----               -------
  Warning  FailedScheduling  4m26s (x2 over 4m32s)  default-scheduler  0/9 nodes are available: 9 pod has unbound immediate PersistentVolumeClaims.
  Warning  FailedScheduling  7s (x5 over 4m23s)     default-scheduler  0/9 nodes are available: 1 node(s) had volume node affinity conflict, 8 node(s) didn't match Pod's node affinity/selector.

kd pvc data-etcd-0                            
Name:          data-etcd-0
Namespace:     etcd
StorageClass:  etcd
Status:        Bound
Volume:        pvc-99f699f7-83bd-474e-be72-402b2a2dc77c
Labels:        app.kubernetes.io/instance=etcd
               app.kubernetes.io/name=etcd
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: ebs.csi.aws.com
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      100Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Used By:       etcd-0
Events:
  Type    Reason                 Age                  From                                                                                      Message
  ----    ------                 ----                 ----                                                                                      -------
  Normal  ExternalProvisioning   5m4s (x2 over 5m4s)  persistentvolume-controller                                                               waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator
  Normal  Provisioning           5m4s                 ebs.csi.aws.com_ebs-csi-controller-7b59d4c568-bx7rz_994a65f4-6811-41bd-839b-437511b6ee50  External provisioner is provisioning volume for claim "etcd/data-etcd-0"
  Normal  ProvisioningSucceeded  4m58s                ebs.csi.aws.com_ebs-csi-controller-7b59d4c568-bx7rz_994a65f4-6811-41bd-839b-437511b6ee50  Successfully provisioned volume pvc-99f699f7-83bd-474e-be72-402b2a2dc77c

➜ kd pv pvc-99f699f7-83bd-474e-be72-402b2a2dc77c
Name:              pvc-99f699f7-83bd-474e-be72-402b2a2dc77c
Labels:            <none>
Annotations:       pv.kubernetes.io/provisioned-by: ebs.csi.aws.com
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      etcd
Status:            Bound
Claim:             etcd/data-etcd-0
Reclaim Policy:    Retain
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          100Gi
Node Affinity:     
  Required Terms:  
    Term 0:        topology.ebs.csi.aws.com/zone in [us-west-2b]
Message:           
Source:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            ebs.csi.aws.com
    FSType:            ext4
    VolumeHandle:      vol-0ea10f98f0d754053
    ReadOnly:          false
    VolumeAttributes:      storage.kubernetes.io/csiProvisionerIdentity=1667774479582-8081-ebs.csi.aws.com
Events:                <none>

➜ kd sc etcd                                   
Name:            etcd
IsDefaultClass:  No
Annotations:     kubectl.kubernetes.io/last-applied-configuration={"allowVolumeExpansion":true,"allowedTopologies":[{"matchLabelExpressions":[{"key":"topology.ebs.csi.aws.com/zone","values":["us-west-2a","us-west-2b","us-west-2c"]}]}],"apiVersion":"storage.k8s.io/v1","kind":"StorageClass","metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"false"},"labels":{"argocd.argoproj.io/instance":"aws-ebs-csi-driver"},"name":"etcd"},"parameters":{"csi.storage.k8s.io/fstype":"ext4","iopsPerGB":"50","tagSpecification_1":"environment=staging","type":"io1"},"provisioner":"ebs.csi.aws.com","reclaimPolicy":"Retain","volumeBindingMode":"Immediate"}
,storageclass.kubernetes.io/is-default-class=false
Provisioner:           ebs.csi.aws.com
Parameters:            csi.storage.k8s.io/fstype=ext4,iopsPerGB=50,tagSpecification_1=environment=staging,type=io1
AllowVolumeExpansion:  True
MountOptions:          <none>
ReclaimPolicy:         Retain
VolumeBindingMode:     Immediate
AllowedTopologies:     
  Term 0:              topology.ebs.csi.aws.com/zone in [us-west-2a, us-west-2b, us-west-2c]
Events:                <none>

info in ebs controller and pods logs

kube-system/ebs-csi-node-xfswb[ebs-plugin]: I1106 22:46:46.171364       1 node.go:454] NodeGetVolumeStats: called with args {VolumeId:vol-043af2d4d87d50674 VolumePath:/var/lib/kubelet/pods/8b4de6ba-3d9d-4da0-a732-27ed069a1534/volumes/kubernetes.io~csi/pvc-547b32a4-653c-4c6a-875e-60a47a0665c7/mount StagingTargetPath: XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
kube-system/ebs-csi-controller-7b59d4c568-bx7rz[csi-resizer]: I1106 22:46:58.037938       1 controller.go:295] Started PVC processing "etcd/data-etcd-0"
kube-system/ebs-csi-controller-7b59d4c568-bx7rz[csi-resizer]: I1106 22:46:58.037962       1 controller.go:318] PV bound to PVC "etcd/data-etcd-0" is not created yet
kube-system/ebs-csi-controller-7b59d4c568-bx7rz[csi-provisioner]: I1106 22:46:58.068985       1 controller.go:1337] provision "etcd/data-etcd-0" class "etcd": started
kube-system/ebs-csi-controller-7b59d4c568-bx7rz[csi-provisioner]: I1106 22:46:58.069592       1 controller.go:528] skip translation of storage class for plugin: ebs.csi.aws.com
kube-system/ebs-csi-controller-7b59d4c568-bx7rz[csi-provisioner]: I1106 22:46:58.070454       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"etcd", Name:"data-etcd-0", UID:"99f699f7-83bd-474e-be72-402b2a2dc77c", APIVersion:"v1", ResourceVersion:"407986343", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "etcd/data-etcd-0"
kube-system/ebs-csi-controller-7b59d4c568-bx7rz[csi-provisioner]: I1106 22:47:04.428929       1 controller.go:774] create volume rep: {CapacityBytes:107374182400 VolumeId:vol-0ea10f98f0d754053 VolumeContext:map[] ContentSource:<nil> AccessibleTopology:[segments:<key:"topology.ebs.csi.aws.com/zone" value:"us-west-2b" > ] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
kube-system/ebs-csi-controller-7b59d4c568-bx7rz[csi-provisioner]: I1106 22:47:04.429014       1 controller.go:858] successfully created PV pvc-99f699f7-83bd-474e-be72-402b2a2dc77c for PVC data-etcd-0 and csi volume name vol-0ea10f98f0d754053
kube-system/ebs-csi-controller-7b59d4c568-bx7rz[csi-provisioner]: I1106 22:47:04.429210       1 controller.go:1442] provision "etcd/data-etcd-0" class "etcd": volume "pvc-99f699f7-83bd-474e-be72-402b2a2dc77c" provisioned
kube-system/ebs-csi-controller-7b59d4c568-bx7rz[csi-provisioner]: I1106 22:47:04.429225       1 controller.go:1455] provision "etcd/data-etcd-0" class "etcd": succeeded
kube-system/ebs-csi-controller-7b59d4c568-bx7rz[csi-provisioner]: I1106 22:47:04.445959       1 event.go:285] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"etcd", Name:"data-etcd-0", UID:"99f699f7-83bd-474e-be72-402b2a2dc77c", APIVersion:"v1", ResourceVersion:"407986343", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-99f699f7-83bd-474e-be72-402b2a2dc77c
kube-system/ebs-csi-controller-7b59d4c568-bx7rz[csi-resizer]: I1106 22:47:04.478422       1 controller.go:295] Started PVC processing "etcd/data-etcd-0"
kube-system/ebs-csi-controller-7b59d4c568-bx7rz[csi-resizer]: I1106 22:47:04.478465       1 controller.go:343] No need to resize PVC "etcd/data-etcd-0"
kube-system/ebs-csi-controller-7b59d4c568-bx7rz[csi-provisioner]: I1106 22:47:13.497894       1 reflector.go:536] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.CSINode total 15 items received
kube-system/ebs-csi-controller-7b59d4c568-bx7rz[csi-attacher]: I1106 22:47:14.703161       1 reflector.go:536] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.CSINode total 16 items received
kube-system/ebs-csi-controller-7b59d4c568-bx7rz[csi-provisioner]: I1106 22:47:23.487257       1 reflector.go:536] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.Node total 24 items received
kube-system/ebs-csi-node-5km2w[ebs-plugin]: I1106 22:47:31.957518       1 node.go:517] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
kube-system/ebs-csi-node-5km2w[ebs-plugin]: I1106 22:47:31.958606       1 node.go:454] NodeGetVolumeStats: called with args {VolumeId:vol-0736c0f1f93069570 VolumePath:/var/lib/kubelet/pods/d03d9ee9-25bc-4a14-a0cf-51ad85f895e3/volumes/kubernetes.io~csi/pvc-d8f4b13b-31bf-4fa1-9535-8a790d5fd1a9/mount StagingTargetPath: XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
kube-system/ebs-csi-controller-7b59d4c568-bx7rz[csi-provisioner]: I1106 22:47:40.481667       1 reflector.go:536] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.StorageClass total 7 items received
kube-system/ebs-csi-controller-7b59d4c568-bx7rz[csi-attacher]: I1106 22:47:41.709670       1 reflector.go:536] k8s.io/client-go/informers/factory.go:134: Watch close - *v1.PersistentVolume total 12 items received
kube-system/ebs-csi-node-xfswb[ebs-plugin]: I1106 22:47:52.386885       1 node.go:517] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
kube-system/ebs-csi-node-xfswb[ebs-plugin]: I1106 22:47:52.417419       1 node.go:454] NodeGetVolumeStats: called with args {VolumeId:vol-043af2d4d87d50674 VolumePath:/var/lib/kubelet/pods/8b4de6ba-3d9d-4da0-a732-27ed069a1534/volumes/kubernetes.io~csi/pvc-547b32a4-653c-4c6a-875e-60a47a0665c7/mount StagingTargetPath: XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
kube-system/ebs-csi-controller-7b59d4c568-bx7rz[csi-provisioner]: I1106 22:47:58.574598       1 reflector.go:536] sigs.k8s.io/sig-storage-lib-external-provisioner/v8/controller/controller.go:845: Watch close - *v1.PersistentVolume total 12 items received
kube-system/ebs-csi-node-rnt7j[ebs-plugin]: I1106 22:48:09.972484       1 node.go:517] NodeGetCapabilities: called with args {XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
kube-system/ebs-csi-node-rnt7j[ebs-plugin]: I1106 22:48:09.973767       1 node.go:454] NodeGetVolumeStats: called with args {VolumeId:vol-0f8c7548f584fa546 VolumePath:/var/lib/kubelet/pods/ae16c978-0206-4a1b-9561-6b016f8391ea/volumes/kubernetes.io~csi/pvc-253a60b3-82fa-452c-8805-f360faa730a4/mount StagingTargetPath: XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}

What you expected to happen?

PVC io1 to be created in us-west-2a as that's where the etcd node is located.

helm chart

controller:
  extraCreateMetadata: "true"
  extraVolumeTags:
    cluster: project-eks
  logLevel: 2
  nodeSelector:
    ops: "true"
  replicaCount: 1

node:
  # tolerateAllTaints: true
  tolerations:
    - effect: NoSchedule
      operator: Exists
  logLevel: 4

storageClasses:
  - allowVolumeExpansion: true
    allowedTopologies:
    - matchLabelExpressions:
      - key: topology.ebs.csi.aws.com/zone
        values:
        - us-west-2a
        - us-west-2b
        - us-west-2c
    annotations:
      storageclass.kubernetes.io/is-default-class: "false"
    name: gp3
    parameters:
      csi.storage.k8s.io/fstype: ext4
      tagSpecification_1: environment=staging
      type: gp3
    provisioner: ebs.csi.aws.com
    reclaimPolicy: Delete
    volumeBindingMode: WaitForFirstConsumer
  - allowVolumeExpansion: true
    allowedTopologies:
    - matchLabelExpressions:
      - key: topology.ebs.csi.aws.com/zone
        values:
        - us-west-2a
        - us-west-2b
        - us-west-2c
    annotations:
      storageclass.kubernetes.io/is-default-class: "false"
    name: gp3-retain
    parameters:
      csi.storage.k8s.io/fstype: ext4
      tagSpecification_1: environment=staging
      type: gp3
    provisioner: ebs.csi.aws.com
    reclaimPolicy: Retain
    volumeBindingMode: WaitForFirstConsumer
  - allowVolumeExpansion: true
    allowedTopologies:
    - matchLabelExpressions:
      - key: topology.ebs.csi.aws.com/zone
        values:
        - us-west-2a
        - us-west-2b
        - us-west-2c
    annotations:
      storageclass.kubernetes.io/is-default-class: "false"
    name: etcd
    parameters:
      csi.storage.k8s.io/fstype: ext4
      iopsPerGB: "50"
      tagSpecification_1: environment=staging
      type: io1
    provisioner: ebs.csi.aws.com
    reclaimPolicy: Retain
    volumeBindingMode: Immediate

sidecars:
  provisioner:
    logLevel: 4
  attacher:
    logLevel: 4
  snapshotter:
    logLevel: 4
  resizer:
    logLevel: 4
  nodeDriverRegistrar:
    logLevel: 4

Environment

- Driver version:

ebs-csi-controller-7b59d4c568-bx7rz ebs-plugin IfNotPresent public.ecr.aws/ebs-csi-driver/aws-ebs-csi-driver v1.12.1 ebs-csi-controller-7b59d4c568-bx7rz csi-provisioner IfNotPresent k8s.gcr.io/sig-storage/csi-provisioner v3.1.0 ebs-csi-controller-7b59d4c568-bx7rz csi-attacher IfNotPresent k8s.gcr.io/sig-storage/csi-attacher v3.4.0 ebs-csi-controller-7b59d4c568-bx7rz csi-resizer IfNotPresent k8s.gcr.io/sig-storage/csi-resizer v1.4.0 ebs-csi-controller-7b59d4c568-bx7rz liveness-probe IfNotPresent k8s.gcr.io/sig-storage/livenessprobe v2.6.0 ebs-csi-node-567pb ebs-plugin IfNotPresent public.ecr.aws/ebs-csi-driver/aws-ebs-csi-driver v1.12.1 ebs-csi-node-567pb node-driver-registrar IfNotPresent k8s.gcr.io/sig-storage/csi-node-driver-registrar v2.5.1 ebs-csi-node-567pb liveness-probe IfNotPresent k8s.gcr.io/sig-storage/livenessprobe v2.6.0

dmitry-mightydevops commented 1 year ago

ok fixed with volumeBindingMode: WaitForFirstConsumer for the storage class etcd, didn't pay attention it was volumeBindingMode: WaitForFirstConsumer