instance volume limits: workloads no longer attach ebs volumes

aydosman commented 2 years ago

/kind bug

What happened? Workloads stop attaching ebs volumes due to reaching instance volume limits, expected number of replicas for our requirement isn’t met and pods are in a pending state.

Nodes have the appropriate limit set to 25 but the scheduler sends more than 25 pods with volumes to a node.

kubelet Unable to attach or mount volumes: unmounted volumes=[test-volume], unattached volumes=[kube-api-access-redact test-volume]: timed out waiting for the condition

attachdetach-controller AttachVolume.Attach failed for volume "pvc-redact" : rpc error: code = Internal desc = Could not attach volume "vol-redact" to node "i-redact": attachment of disk "vol-redact" failed, expected device to be attached but was attaching

ebs-csi-controller driver.go:119] GRPC error: rpc error: code = Internal desc = Could not attach volume "vol-redact" to node "i-redact": attachment of disk "vol-redact" failed, expected device to be attached but was attaching

How to reproduce it (as minimally and precisely as possible)?

Deploying the test below should be sufficient in simulating the problem

apiVersion: v1
kind: Namespace
metadata:
  name: vols
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  namespace: vols
  name: vols-pv-test
spec:
  selector:
    matchLabels:
      app: vols-pv-test 
  serviceName: "vols-pv-tester"
  replicas: 60
  template:
    metadata:
      labels:
        app: vols-pv-test
    spec:
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        volumeMounts:
        - name: test-volume
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: test-volume
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "******" # something with a reclaim policy of delete
      resources:
        requests:
          storage: 1Gi

Update: adding a liveness probe with an initial delay of 60 seconds seems to get around the problem, our nodes scale, the replica count is correct with volumes attached.

apiVersion: v1
kind: Namespace
metadata:
  name: vols
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  namespace: vols
  name: vols-pv-test
spec:
  selector:
    matchLabels:
      app: vols-pv-test 
  serviceName: "vols-pv-tester"
  replicas: 60
  template:
    metadata:
      labels:
        app: vols-pv-test
    spec:
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web
        livenessProbe:
          tcpSocket:
            port: 80
          initialDelaySeconds: 60
          periodSeconds: 10          
        volumeMounts:
        - name: test-volume
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: test-volume
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "******" # something with a reclaim policy of delete
      resources:
        requests:
          storage: 1Gi

Environment

Kubernetes version: Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.5-eks-bc4871b", GitCommit:"5236faf39f1b7a7dabea8df12726f25608131aa9", GitTreeState:"clean", BuildDate:"2021-10-29T23:32:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
Version: Helm Chart: v2.6.2 Driver v1.5.0

ryanpxyz commented 2 years ago

Hello,

I believe we are seeing this too.

Warning  FailedAttachVolume  34s (x11 over 8m47s)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-61b4bf2c-541f-4ef1-9f21-redacted" : rpc error: code = Internal desc = Could not attach volume "vol-redacted" to node "i-redacted": attachment of disk "vol-redacted" failed, expected device to be attached but was attaching

...with:

# /bin/kubelet --version
Kubernetes v1.20.11-eks-f17b81

and:

k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.4.0

Thanks,

Phil.

stevehipwell commented 2 years ago

I suspect this is a race condition somewhere, my current thoughts are that it's the scheduler but I haven't had a chance to look at it further.

gnufied commented 2 years ago

is CSI driver running with correctly defined limits? What does CSINode object from node reports?

stevehipwell commented 2 years ago

@gnufied the CSI driver looks to be doing everything correctly, AFAIK the only thing it needs to do is report the max PV attachments it can make. As reported above if you add in latency between pod scheduling the pods are sent to nodes with space for the PV mounts which is why I suspect it's a scheduler issue.

gnufied commented 2 years ago

@stevehipwell No that shouldn't happen. We start counting volumes against the limit before pods are even started on the node. I am still waiting on output of CSINode object from the problematic node.

stevehipwell commented 2 years ago

@gnufied I agree that the CSI driver is reporting correctly, which combined with the 60s wait fixing the issue makes me believe that this issue is actually happening elsewhere as a race condition.

sultanovich commented 2 years ago

I am seeing the same problem in my environment:

Events:
  Type     Reason       Age   From                                   Message
  ----     ------       ----  ----                                   -------
  Normal   Scheduled    2m5s  default-scheduler                      Successfully assigned 2d3b9d81e0b0/master-0 to ip-10-3-109-222.ec2.internal
  Warning  FailedMount  2s    kubelet, ip-10-3-109-222.ec2.internal  Unable to attach or mount volumes: unmounted volumes=[master], unattached volumes=[master backups filebeat-configuration default-token-vjscf]: timed out waiting for the condition

I haven't been able to confirm if they are related, but I see this happening on the same node that has eni's interfaces in "attaching" state and I see the following errors in /var/log/aws-routed-eni/ipamd.log:

{"level":"error","ts":"2022-02-01T19:45:34.410Z","caller":"ipamd/ipamd.go:805","msg":"Failed to increase pool size due to not able to allocate ENI AllocENI: error attaching ENI: attachENI: failed to attach ENI:AttachmentLimitExceeded: Interface count 9 exceeds the limit for c5.4xlarge\n\tstatus code: 400, request id: 836ce9b1-ec63-4935-a007-739e32f506cb"}

For reference, the c5.x4xlarge instance type in AWS supports 8 eni's.

We are testing why we can limit it using the volume-attach-limit variable which is not set yet, but I would like to understand first why it happens and if there is a way to not hardcode that value.

Environment

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:14:22Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.11-eks-f17b81", GitCommit:"f17b810c9e5a82200d28b6210b458497ddfcf31b", GitTreeState:"clean", BuildDate:"2021-10-15T21:46:21Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}

[user@admin [~] ~]$ kubectl -n kube-system get deployment ebs-csi-controller -o wide -o yaml | grep "image: k8s.gcr.io/provider-aws/aws-ebs-csi-driver"
        image: k8s.gcr.io/provider-aws/aws-ebs-csi-driver:v1.0.0
[user@admin [~] ~]$

[root@i-6666919f09cc78046 ~]# /usr/bin/kubelet --version
Kubernetes v1.19.15-eks-9c63c4
[root@i-6666919f09cc78046 ~]#

ryanpxyz commented 2 years ago

Hello,

... update from our side:

Our first simple workaround as we first observed the problem yesterday (might help others who are stuck and looking for a 'quick fix'):

... cordon the current node that the pod is stuck in 'Init ...' on. ... delete the pod ... ... verify that the pod is started successfully on an alternative node. If not ... ... repeat 'cordoning' until the pod is successfully deployed. ... uncorden (all) node(s) upon successful deployment.

Then following a dive into the CSI EBS driver code, we passed the option '--volume-attach-limit=50' to the 'node driver'. I haven't tested this explicitly yet however.

The problem to me seems to be a missing feedback loop between the 'node driver' and the scheduler.

The scheduler says, "Hey, there's a node that satisfies my scheduling criteria ... I'll schedule the workload to run there ..." and the node driver says, "OK, I have a workload but I've reached this '25 attached volumes' limit so I'm done here ...".

This is just my perhaps primitive view of the situation.

Thanks,

Phil.

PS ... following a re-deployment of the 'csi ebs node driver' we are still seeing the attribute 'attachable-volumes-aws-ebs' as set to 25 on a 'describe node':

... we weren't expecting this.

stevehipwell commented 2 years ago

@ryanpxyz looking at the code I think the CSI just reports how many attachments it can make. Until the PR to make this dynamic is merged and released this is a fixed value by instance type or arg. This means there are two related but distinct issues.

The first is the incorrect max value that doesn't take into accoun all nitro instances and their other attachments. For example a nitro instance (only 5 series) and no arg will have a limit of 25, which is correct as long as you only have 3 extra attachments. If you're using custom networking and prefixes this means instances without an additional NVMe drive work but ones with this get stuck.

The second problem, which is what this issue is tracking, is that when meeting the criteria for a correctly reported max it is still possible that too many pods will be scheduled on a node.

stevehipwell commented 2 years ago

@sultanovich see my reply above about the attachment limits. I think there is a separate issue and PR for resolving your problem.

aydosman commented 2 years ago

@gnufied output from the workers while running the original tests with limits

Name:               ip-**-**-**-**.eu-west-1.compute.internal
Labels:             <none>
Annotations:        storage.alpha.kubernetes.io/migrated-plugins: kubernetes.io/cinder
CreationTimestamp:  Thu, 17 Feb 2022 07:34:19 +0000
Spec:
  Drivers:
    ebs.csi.aws.com:
      Node ID:  i-redacted
      Allocatables:
        Count:        25
      Topology Keys:  [topology.ebs.csi.aws.com/zone]
Events:               <none>

Name:               ip-**-**-**-**.eu-west-1.compute.internal
Labels:             <none>
Annotations:        storage.alpha.kubernetes.io/migrated-plugins: kubernetes.io/cinder
CreationTimestamp:  Thu, 17 Feb 2022 07:44:10 +0000
Spec:
  Drivers:
    ebs.csi.aws.com:
      Node ID:  i-redacted
      Allocatables:
        Count:        25
      Topology Keys:  [topology.ebs.csi.aws.com/zone]
Events:               <none>

sultanovich commented 2 years ago

@stevehipwell I have no doubts as to the limits that can be annexed. My question is about why it happens and how to solve it.

I have generated a new issue (#1174) since the volume-attach-limit argument has not worked for me either.

Perhaps the inconvenience of the eni 's in the attaching state is due to another cause, what he commented is that after that error I begin to see the problems of the volumes.

stevehipwell commented 2 years ago

@sultanovich I think I've explained pretty well everything I know about this issue. Let me reiterate that there are two bugs here, the first one which is related to nodes not being picked up as nitro or not having 25 free attachment slots is being addressed by https://github.com/kubernetes-sigs/aws-ebs-csi-driver/pull/1075, the second and currently unexplained is related to the speed at which requests for pods with PVs are sent to the scheduler. The second scenario is what this issue was opened for, with your new issue there are now a number of other issues relating to the first scenario.

Perhaps the inconvenience of the eni 's in the attaching state is due to another cause, what he commented is that after that error I begin to see the problems of the volumes.

The current driver doesn't take any dynamic attachments into consideration, you get 25 if the node is detected as nitro or 39 if not. If you are getting failures on nitro instance that isn't a 5 series, or has NVMe drives or is using more than 2 ENIs you should be able to statically fix the problem by using the --volume-attach-limit argument. If you're using a m5 instance but requesting lots of PVs it's likely that you're seeing this issue; you should be able to stop it happening by changing your deployment strategy and adding a wait between pods.

gnufied commented 2 years ago

@ryanpxyz you are looking at wrong place for attachable limits of CSI driver. Attach limit of CSI driver is reported via CSINode objects. if we are not rebuilding CSINode objects during redeploy of driver - that sounds like a bug. So setting --volume-attach-limit and redeploying driver should set correct limits.

As for bug in scheduler - here is the code for counting the limits https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/nodevolumelimits/csi.go#L210 . Its been awhile since I looked in to the scheduler code, but if scheduler is not respecting limits reported by CSINode then that would be a k/k bug (and we are going to need one).

gnufied commented 2 years ago

@bertinatto - are you aware of a bug where if many pods are scheduled at once to a node then scheduler may not correctly count the volume limits?

stevehipwell commented 2 years ago

@gnufied it looks like it's the Filter function that is doing the work we're interested in. Unless only a single pod can be scheduled at a time, which is unlikely, this code looks like it isn't checking for other in flight requests and could easily result in over provisioning volumes on a node.

I would expect to see something to lock a CSINode so only one calculation at a time could run, but I might be missing something here as I'm not really familiar with this part of the codebase?

As an aside would supporting Storage Capacity Tracking help limit the blast radius of this issue?

sultanovich commented 2 years ago

@gnufied I tried setting the --volume-attach-limit argument in a test environment and it worked fine. The only limitation that I find is that it applies to the entire cluster, if I have nodes, the other types of instances in AWS could limit the number of volumes that I can host, increasing infrastructure costs.

Do you have any idea how long it might take to modify this check to take the correct limits on all instance types?

stevehipwell commented 2 years ago

@sultanovich this issue isn't the same as #1174, please don't confuse them.

stevehipwell commented 2 years ago

@gnufied @bertinatto do you have any more thoughts on this? I doubt I've read the code correctly so would appreciate someone looking at the code I mentioned above to see if they can see the same potential issue?

stevehipwell commented 2 years ago

On further testing of this it looks like this has been fixed via an EKS platform version update (I suspect), I'd be interested if anyone knows what exactly was fixed?

jrsdav commented 2 years ago

@stevehipwell The EKS AMI changelog for the most recent v20220406 release had one interesting note that might be relevant:

The bootstrap script will auto-discover maxPods values when instanceType is missing in eni-max-pods.txt

stevehipwell commented 2 years ago

@jrsdav thanks for looking out but that functionality sets a kubelet arg (incorrectly in most cases) and isn't related to storage attachments. This issue wasn't ever about the correct max value being set for attachments, that's a separate issue with a fix coming in the next minor version, it was a scheduling issue that didn't make much sense.

LeeHampton commented 2 years ago

We're experiencing this issue as well. Except, from some of the discussion above it sounds like people think it's some kind of scheduling race condition. In our case, it seems like the volume attachments are never being properly counted. We have a node with 25 attachments, but the Allocated Resources section under kubectl describe node show zero attachments:

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests        Limits
  --------                    --------        ------
  cpu                         35170m (73%)    38 (79%)
  memory                      130316Mi (68%)  131564Mi (68%)
  ephemeral-storage           0 (0%)          0 (0%)
  hugepages-1Gi               0 (0%)          0 (0%)
  hugepages-2Mi               0 (0%)          0 (0%)
  attachable-volumes-aws-ebs  0               0

Any leads on what might be causing that to happen?

gnufied commented 2 years ago

Again looks like you are looking at wrong object. CSI volume limits are counted via CSINode objects. So please check what value it is reporting.

LeeHampton commented 2 years ago

@gnufied Ah, okay. Thank you. It looks like the "allocatables" are indeed being properly counted, which I guess puts us in the race condition boat:

k describe csinode  ip-172-20-60-87.ec2.internal

Name:               ip-172-20-60-87.ec2.internal
Labels:             <none>
Annotations:        <none>
CreationTimestamp:  Wed, 27 Apr 2022 05:12:27 -0400
Spec:
  Drivers:
    ebs.csi.aws.com:
      Node ID:  i-0f37978c6d1e25a52
      Allocatables:
        Count:        25
      Topology Keys:  [topology.ebs.csi.aws.com/zone topology.kubernetes.io/zone]
Events:               <none>

LeeHampton commented 2 years ago

@gnufied , actually is "Allocatables" just the total limit? How do I see what it thinks is currently allocated?

Legion2 commented 2 years ago

We are using csi volumes and in-tree volumes at the same time and see similar errors. Even if csi volumes are counted correctly, there are also non csi volumes attached to the nodes which result in the underlying node limited to be exceeded. Is this situation addressed by any of the linked issues?

pkit commented 2 years ago

@gnufied csinode reports total bullshit as well

Spec:
  Drivers:
    efs.csi.aws.com:
      Node ID:  i-0cf67b141b4d31d04
    ebs.csi.aws.com:
      Node ID:  i-0cf67b141b4d31d04
      Allocatables:
        Count:        39
      Topology Keys:  [topology.ebs.csi.aws.com/zone]

And it fails exactly after 25 volumes.

pkit commented 2 years ago

Okay, after browsing through code it looks like it uses a lot of "heuristics" based on some AWS docs and not on the actual truth on the field. Raised a support ticket with AWS to try to find out what's going on and why that information is not available through metadata, for example.

jortkoopmans commented 2 years ago

Experiencing similar EBS scheduling issues here, while I think multiple bugs/problems are mixed in discussions in this interesting ticket.

My situation (and 2 cents):

AWS EKS version 1.23.9. Recent EKS optimized AMI v20220824.
EBS CSI driver version: 1.11.2
Using CSI in backward compat mode (not sure if it matters, probably not)
This is a 'real live' situation, I'm not following OP example of a single statefulset which could benefit from the workaround of a initialDelaySeconds parameter to stage pod rollout. Separate pods are stuck together in my situation.
I believe the occurrence is intermittent, depending on pod scheduling pressure on the node (?).

This happens when a new node joins, Nitro type.

CSINode reports OK:

Name:               ip-10-18-56-105.eu-central-1.compute.internal
Labels:             <none>
Annotations:        storage.alpha.kubernetes.io/migrated-plugins: kubernetes.io/aws-ebs,kubernetes.io/azure-disk,kubernetes.io/cinder,kubernetes.io/gce-pd
CreationTimestamp:  Mon, 05 Sep 2022 11:06:32 +0200
Spec:
  Drivers:
    ebs.csi.aws.com:
      Node ID:  i-0680a3a4beff784e8
      Allocatables:
        Count:        25
      Topology Keys:  [topology.ebs.csi.aws.com/zone]
Events:               <none>

Pods are being scheduled, up to 24 working volumes. Then there are 2 additional volume stuck in Attaching state. Technically only 1 of those could be attached given the Nitro limits. I therefore suspect some sort of race condition. This might be identical to the more narrowly defined issue #1278, but not fully sure.

Manual workaround to get pods unstuck is to cordon the node and kill the stuck pods, then the stuck PVC will be moved away. Subsequently you can uncordon.

ddmunhoz commented 2 years ago

@jortkoopmans we also have the same issue as yours, as a temporary workaround, we deployed the CSI driver using the helm chart and set --volume-attach-limit=XX.

This will change the limit for all nodes/instance types in your cluster.

To test it out first, change it manually on the csi-driver daemon set.

pkit commented 2 years ago

@jortkoopmans it looks like it's exactly #1278 I've tested it too. It's always the last two pods that fail if scheduled dynamically one after another. Essentially it's a deal breaker for any dynamic allocation of pods.

pkit commented 2 years ago

@ddmunhoz as you can see from #1278 it does not help at all. Pods are still stuck.

ddmunhoz commented 2 years ago

@pkit I can’t vouch for anyone else, it would be hard to confirm without knowing the state of the cluster/number of EBS volumes attached that were not migrated to csi provisioner and all that affects the calculation in the end.

But here in our cluster that has a high dynamic load, we deploy mongoDbs as a service from our API, it does work fine.

Granted that the CSI driver is the only one provisioning/attaching volumes and that all volumes were created/attached by it.

pkit commented 2 years ago

@ddmunhoz Do you use StatefulSet for Mongos? We provision databases with a multi-pod StatefulSet and it gets stuck. No matter how you reduce the --volume-attach-limit=N to, it will just always happen on a PVC(N-1)

jmhwang7 commented 2 years ago

We run and manage our own k8s clusters on top of EC2 instances, but we run the aws-ebs-csi-driver to manage ebs volumes.

We discovered a pretty gnarly bug in 1.10: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1361 . After setting the volume-attach-limit as suggested in that issue, we started seeing exactly what @jortkoopmans detailed in this comment: https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1163#issuecomment-1237132913 . Even for older nodes (multiple days/weeks old), when 2 pods get scheduled around the same time, only the N-1 pod, where N is the volume-attach-limit set, has its volume attached successfully and the Nth pod's volume gets stuck in a "attaching" state.

We initially thought this was due to a race condition in the Kubernetes scheduler, and so we lowered the manually set volume-attach-limit. We are still getting paged/running into this issue despite the fact that the node has significantly less volumes attached than the ec2 instance can support (24 volumes + 1 eni = 25, 28 is the limit for the nitro instance, 24th volume gets stuck in attaching).

@pkit did you have any luck with the support ticket you filed with AWS?

pkit commented 2 years ago

@jmhwang7 nothing got out of there, they provided the exact "calculation" for nitro volumes that is already used in the driver. See their answer below:

Please allow me to inform you that most of the Nitro instances support a maximum of 28 attachments. Attachments include network interfaces, EBS volumes, and NVMe instance store volumes.

However, there are exceptions for few Nitro instances. For these instances, the following limits apply:

d3.8xlarge and d3en.12xlarge instances support a maximum of 3 EBS volumes.

inf1.xlarge and inf1.2xlarge instances support a maximum of 26 EBS volumes.

inf1.6xlarge instances support a maximum of 23 EBS volumes.

inf1.24xlarge instances support a maximum of 11 EBS volumes.

Most bare metal instances support a maximum of 31 EBS volumes.

mac1.metal instances support a maximum of 16 EBS volumes.

High memory virtualized instances support a maximum of 27 EBS volumes.

High memory bare metal instances support a maximum of 19 EBS volumes. If you launched a u-6tb1.metal, u-9tb1.metal, or u-12tb1.metal high memory bare metal instance before March 12, 2020, it supports a maximum of 14 EBS volumes. To attach up to 19 EBS volumes to these instances, contact your account team to upgrade the instance at no additional cost.

The same is mentioned in the below AWS documentation: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/volume_limits.html#instance-type-volume-limits

All other AWS Nitro instances (excluding the above mentioned exceptions) support maximum of 28 attachments.

pkit commented 2 years ago

@jmhwang7

We initially thought this was due to a race condition in the Kubernetes scheduler, and so we lowered the manually set volume-attach-limit. We are still getting paged/running into this issue despite the fact that the node has significantly less volumes attached than the ec2 instance can support (24 volumes + 1 eni = 25, 28 is the limit for the nitro instance, 24th volume gets stuck in attaching).

What I'm seeing is that calculation is much more complex than "28 for Nitro", so suggest trying to lower the number of volumes to something like 23 (as naturally all our csinode objects are 25 max right now on Nitro)

jortkoopmans commented 2 years ago

What I'm seeing is that calculation is much more complex than "28 for Nitro", so suggest trying to lower the number of volumes to something like 23 (as naturally all our csinode objects are 25 max right now on Nitro)

My understanding is that also ENI's need to be subtracted from this number and the EC2 boot volume. So your numbers will likely vary depending on the amount of ENIs attached given the number of pods on a cluster.

Example one of my nodes; 28 (nitro limit) - 1 (boot volume) - 4 ENIs = 23 Which matches the number reported in csinodes

This of course emphasizes how wasteful it is to having to override the limit manually, as you need to set this to the lowest number possible in your node/ENI configuration.

pkit commented 2 years ago

@jortkoopmans yup, but right now my ops want to go even further and limit total number of pods on a node. That's how much of a clusterfuck this one is...

sotiriougeorge commented 2 years ago

@sultanovich I think I've explained pretty well everything I know about this issue. Let me reiterate that there are two bugs here, the first one which is related to nodes not being picked up as nitro or not having 25 free attachment slots is being addressed by #1075, the second and currently unexplained is related to the speed at which requests for pods with PVs are sent to the scheduler. The second scenario is what this issue was opened for, with your new issue there are now a number of other issues relating to the first scenario.

Perhaps the inconvenience of the eni 's in the attaching state is due to another cause, what he commented is that after that error I begin to see the problems of the volumes.

The current driver doesn't take any dynamic attachments into consideration, you get 25 if the node is detected as nitro or 39 if not. If you are getting failures on nitro instance that isn't a 5 series, or has NVMe drives or is using more than 2 ENIs you should be able to statically fix the problem by using the --volume-attach-limit argument. If you're using a m5 instance but requesting lots of PVs it's likely that you're seeing this issue; you should be able to stop it happening by changing your deployment strategy and adding a wait between pods.

Hi again, we are running a stress test on our cluster utilizing some m6a machine types. Truth be told we are using a kind of old ebs-csi-driver version due to the Terraform module used to deploy our infrastructure (v1.5.1). I tried increasing the ebs-csi-driver Helm chart value volumeAttachLimit: to 70 and while the change was applied I can see describing the node that the 39 volume attachment attribute is still there.

Capacity:
  attachable-volumes-aws-ebs:  39
  cpu:                         64
  ephemeral-storage:           157274092Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      258588856Ki
  pods:                        250
Allocatable:
  attachable-volumes-aws-ebs:  39
  cpu:                         63770m
  ephemeral-storage:           143870061124
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      249923768Ki
  pods:                        250

I will also admit I don't have the best experience with either EKS or EBS driver and I am trying to wrap my head around on what the problem is. What could I try in my use case or what would you advise?

stevehipwell commented 2 years ago

@sotiriougeorge I might be completely missing the point of your question here, and you really need to provide the actual version you're using, but I'll go on.

Firstly as the m6a instance is a Nitro instance you're physically limited to 28 attachments (see docs) of which some might also be used for non-volume attachments. So I don't see any value in you raising the limit to 70 as it's not physically possible to attach more than 27 volumes (if attaching nothing else as they all need an ENI) to a Nitro instance.

Secondly this thread (and others) covers the limitations in earlier version of this CSI driver specifically around detecting anything not a 5th generation instance as a Nitro instance and not calculating available attachments using the existing attachments. All of these are fixed in recent versions of the CSI driver. I've seen you've commented on https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1258 so you're aware that you should discount the attachable-volumes-aws-ebs value and look at the CSINode instead.

My recommendation would be to update to the latest version of the driver and see how that works for you.

sotiriougeorge commented 2 years ago

@stevehipwell thank your for your prompt and detailed reply and explanation. I have a clearer scope now.

Just to provide some clarity, I am trying to stress test my cluster with some workloads, through the use of Helm. Sadly those workloads require (the majority of them) an EBS volume attachment and this is when I ran into the error message, AttachVolume.Attach failed for volume "pvc-redact" : rpc error: code = Internal desc = Could not attach volume "vol-xxxxxxx" to node "i-xxxxxxx": attachment of disk "vol-xxxxxx" failed, expected device to be attached but was attaching which is what led me here in the first place.

From reading both your replies and searching also the docs you recommended, I relised the following:

One ENI is immediately attached to one Node and
(if the behavior is not changed explicitly) a second ENI is pre-emptively attached as soon as the first one starts being used
those ENIs and the root volume of the Node consume the number of max EBS attachments you can have on a Node
the max EBS volume attachments you can have on a Node is pre-defined and unchangeable

So within the context of my use case - when I try a lot of small workloads on an m6a Node which is otherwise capable of supporting hundreds of Pods I am inevitably going to run into the issue of "running out" of available attachments if all my Pods require their own volume.

To make matters worse a large number of small Pods all requiring IP addresses increases the amount of ENI attachments on my Node which further lowers my available EBS attachments.

So I could try bumping the version of the driver per your suggestion but if I got everything correctly , that wouldn't help in what I am trying to do. The "sensible" thing is to use either larger workloads to fill-up my Node or workloads that don't require EBS attachments in their entirety.

dyasny commented 2 years ago

So within the context of my use case - when I try a lot of small workloads on an m6a Node which is otherwise capable of supporting hundreds of Pods I am inevitably going to run into the issue of "running out" of available attachments if all my Pods require their own volume.

To make matters worse a large number of small Pods all requiring IP addresses increases the amount of ENI attachments on my Node which further lowers my available EBS attachments.

My use case exactly. Essentially, this is AWS forcing you to pay for more instances. I am currently working on two things:

Drop VPC-CNI for some overlay-based setup, this should mitigate the ENI attachment limitation (yes I am aware of the prefixes hack and it still doesn't cut it)
Drop the usage of EBS for something self managed and more suitable for the use case of many small pods with many small volumes attached.

stevehipwell commented 2 years ago

@sotiriougeorge not directly CSI related but I'd suggest switching over to IP Prefix mode which should mean you only need a single ENI (or 2 for custom networking). Secondly according to to the Kubernetes documentation 110 pods per node is the upper limit and is a good rule of thumb, the original EKS limit is based on the maximum IPs per instance which I can't see any real justification for once it passed the 110 value. Thirdly Kubernetes isn't designed for primarily stateful workloads and where they are used it's usually for a service which has a high resource requirement meaning that you don't need to bin pack lots of pods onto the same node.

Out of interest what sort of load testing needs actual volumes per pod rather than just an emptyDir?

sotiriougeorge commented 2 years ago

@stevehipwell the concept is that the platform I am working on (obviously backed up by EKS at this point) offers its end-users the option to deploy their own workloads, all of which ... or rather most of which are backed by EBS volumes.

So the stress-testing of the cluster aims to discover what would happen if the users "went ham" on the platform and what kind of restrictions should be put in place as far as workload deployments are concerned.

Out of interest what sort of load testing needs actual volumes per pod rather than just an emptyDir?

I'd say it's more of a "volumes per most Pods".

I thank you for your suggestions though , the IP Prefix mode is something I had seen but hadn't found the time to deep dive and see how it would help me - sometimes you can only absorb so much new info. I'm done hijacking this thread!

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

greenaar commented 1 year ago

/remove-lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

greenaar commented 1 year ago

/remove-lifecycle stale

kubernetes-sigs / aws-ebs-csi-driver

instance volume limits: workloads no longer attach ebs volumes #1163

Environment