Static EBS Volume Allocations on CSINode CRDs

CoreyCook8 commented 2 months ago

/kind bug

What happened?

Persistent Volumes failed to attach to a host due to EBS Volume Limits not adjusting for dynamically allocated ENIs.

What you expected to happen?

The EBS Volume allocation field on the CSINode should update at some point after the pod starts up.

While it might not be ideal to continuously update this field, there should be some kind of logic to re-calculate these values other than only updating on startup. It could be in the event that the CSI driver receives a timeout waiting for the EBS volumes to attach to a node?

How to reproduce it (as minimally and precisely as possible)?

Startup a host in a k8s cluster. Check the corresponding CSINode allocatable number of EBS volumes. After the aws-ebs-csi-driver has started, attach an ENI to the host. Observe the CSINode allocatable number of EBS volumes does not change. Assign the CSINode allocatable number of EBS volumes to a host. Verify that the last one cannot mount the host and the CSINode allocatable field is not updated.

Anything else we need to know?:

For our use case, cilium is dynamically provisioning ENIs which scale based on the number of pods scheduled to the host. Because of the dynamic nature here, it is not ideal to allocate a number of --reserved-volume-attachments In the event of a container restart, the new allocation calculation will count the attached ENIs and the allocatable EBS volumes will be less than what is actually available.

Example: We start with 28 EBS volume slots, 2 host EBS volumes, 1 ENI. The calculated allocatable field will be 25. If we reserve 7 slots we are down to 20 (Total - reserved - attached ENIs) and we ensure cilium has 4 ENI slots to dynamically allocate. If the aws-ebs-csi-driver restarts after cilium is using 4 ENI slots, we will be down to 16 (Total - reserved - 5 ENIs) allocatable EBS volumes and 4 will be unused.

Environment

Kubernetes version (use kubectl version): 1.28
Driver version: v1.32.0

wmgroot commented 1 month ago

We've also run into this problem and have been considering what could be done to address it.

One alternative to consider to updating the allocation calculation dynamically is to assume that every potential ENI is already in use when computing the allocation amount.

Pros:

Less moving parts, no chance of a scheduling race condition occurring.

Cons:

Allocatable volumes will be slightly lower than what each node could actually accommodate.

For our use cases, maximizing the number of supported volumes isn't critical. Following your example, losing a handful of potential volumes on a node that can hold 25 is acceptable, and k8s should provision additional nodes to satisfy pods that require volume mounts without issue.

torredil commented 1 month ago

A general update on this: we are actively working with the relevant Kubernetes sigs to better account for the dynamic nature of volume limits on EC2. There is a KEP which aims to address this pain point and is available for review here: https://github.com/kubernetes/enhancements/pull/4875. Please feel free to leave any feedback directly on the PR.

In the meantime, we recommend adopting one or more of the below solutions if you are experiencing volume limit issues caused by the limit changing after startup:

Dedicated EBS Instance Types: gen7 and later have dedicated EBS volume limits and are not subject to this dynamic ENI attachment taking up volume slots issue.
Use VPC CNI's prefix delegation feature.
The driver can be started with the CLI option --volume-attach-limit to explicitly specify the limit for volumes to be reported to Kubernetes. This parameter can be used in cases where you have a known safe limit.
The driver can be started with the CLI option --reserved-volume-attachments to reserve a number of slots for non-CSI volumes. These reserved slots will be subtracted from the total slots reported to Kubernetes.
For clusters that need a mix of the above solutions, our Helm chart has the ability to construct multiple DaemonSets via the additionalDaemonSets parameter. For more information about this feature, see the docs.

torredil commented 1 month ago

@wmgroot Thank you for this feedback, it's very helpful! Please take a look at the solutions listed above, one or more of these may be suitable for your specific use case and could help alleviate the pain in the short to medium term while we work on the Kubernetes enhancement. We're aware that this specific issue is incredibly disruptive and are striving to resolve it.

For our use cases, maximizing the number of supported volumes isn't critical.

That absolutely makes sense, especially as the existing behavior leads to stuck volume attachments when the driver attempts to maximize the attachment slots. As you could imagine, changing this default behavior would be incredibly difficult because there are certainly a large number number of users who care deeply about maximizing the attachment slots (even if its done on a best effort basis). That said, PTAL at the --reserved-volume-attachments configuration option, it may be a very helpful workaround for your specific use case while the team works on the enhancement 🙂

wmgroot commented 1 month ago

No worries. I definitely wasn't hoping that the default behavior would change, and definitely understand that there are probably users out there who care a lot about maximizing the number of pods with EBS volumes they can fit onto their nodes. I would expect that my suggestion was simply an option not enabled by default that could be toggled on if it made sense for each user.

Regarding --reserved-volume-attachments, we operate EBS CSI over a slew of different instance types provided by Karpenter. I think we rely on the dynamic nature of --reserved-volume-attachments=-1 because of that. In theory we could try to set a value that covers the biggest instance types, but this could break if AWS provides EC2 instances with larger values in the future, or if smaller instance types don't support enough volume attachments to offset the largest ENI counts on bigger instances.

Running multiple daemonsets to handle different instance type requirements in the strategy above also seems like it could easily spiral out of control.

I'll take a look at the prefix delegation option you've linked, I'm not sure my team is aware of that recommendation or if it would help us in this case.

torredil commented 1 month ago

@wmgroot Makes sense, thank you for the additional context.

With regards to prefix delegation, it's generally considered a best practice so it's definitely worth exploring. There are some good recommendations, more helpful background, and instructions for enabling this mode in the aws-eks-best-practices/networking docs.

kubernetes-sigs / aws-ebs-csi-driver

Static EBS Volume Allocations on CSINode CRDs #2153