Closed mpatlasov closed 3 months ago
Hey @mpatlasov, thank you for raising this issue up! We will add this count of accelerators for these instance types to node startup by next release (as well as any other devices that we are missing).
Really appreciate the detailed ramp up and resources on this!
/assign @ElijahQuinones
@AndrewSirenko: GitHub didn't allow me to assign the following users: ElijahQuinones.
Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide
/priority important-soon
Hi @mpatlasov,
The PR for Gpus not being factored in has already been merged, and the PR for accelerators is in review right now.
As for your observation:
| There are must be other contributors (other than GPUs) because for vt1* instance types actual number doesn't decrease monotonically
The VT instance type is special in that both the vt1.3xlarge and vt1.6xlarge have accelerators that take up two attachment slots each. As for the vt1.24xlarge it's accelerators do not take up any attachment slots at all. This is not well documented and I have cut an internal documentation ticket to correct this.
Please let me know if you have any further questions or concerns!
/kind bug
What happened?
kubectl get csinode <node-name> -o json | jq .spec.drivers
says that allocatable.count is 26 for vt1 instance types and 25 for g4 ones. While actual number of volumes that can be attached to the node is smaller:type / reported / actual g4dn.xlarge / 25 / 24 g4ad.xlarge / 25 / 24 vt1.3xlarge / 26 / 24 vt1.6xlarge / 26 / 22
There are many other g4 instance types mentioned here, but I verified the issue only for g4dn.xlarge and g4ad.xlarge. Reported number for vt1.24xlarge (26) is correct, while numbers for other vt1 types are not.
What you expected to happen?
kubectl get csinode
must report correct max number of volumes to be attached.How to reproduce it (as minimally and precisely as possible)?
Apply the following StatefulSet with 26 replicas:
In a while some pods get stuck at "ContainerCreating" status caused by volumes stuck at attaching status and couldn't be attached to the node. An error for a pod which got stuck looks like that:
Anything else we need to know?:
Official doc "Amazon EBS volume limits for Amazon EC2 instances" states clearly that GPU (or accelerators) must be counted:
While getVolumesLimit() doesn't take care. It starts from availableAttachments=28 for Nitro instances, then applies the following arithmetic:
e.g. 28 - 1 - 1 - 1 == 25 for g4ad.xlarge.
There are must be other contributors (other than GPUs) because for vt1* instance types actual number doesn't decrease monotonically:
type / reported / actual vt1.3xlarge / 26 / 24 vt1.6xlarge / 26 / 22 vt1.24xlarge / 26 / 26
I.e., it's hard to explain <24 , 22 , 26> solely from number-of-accelerators considerations.
Environment
kubectl version
):Driver version: Compiled manually (by
docker build -t quay.io/rh_ee_mpatlaso/misc:aws-ebs-csi-drv-upstream -f Dockerfile .
) from the head of master branch of https://github.com/kubernetes-sigs/aws-ebs-csi-driver :