linode / linode-blockstorage-csi-driver

Container Storage Interface (CSI) Driver for Linode Block Storage
Apache License 2.0
64 stars 54 forks source link

Linodes cannot mount more than 7 PVC on instances with 8GB or less. #182

Closed codestation closed 2 months ago

codestation commented 2 months ago

General:


Feature Requests:


Bug Reporting

Small Linodes (i tested 4GB and 8GB) cannot have more than 7 attached PVC per node. According to the documentation local disks and block storage are counted aganist the max volumes that can be attached to a node.

Expected Behavior

The pod should be relocated to another node since the max number of PVC has been reached.

Actual Behavior

This gets emmited non-stop on the pod trying to attach a 8th volume to the node. The pod remains on the node.

AttachVolume.Attach failed for volume "pvc-xxxxxxxxxxx" : rpc error: code = ResourceExhausted desc = max number of volumes (8) already attached to instance

Steps to Reproduce the Problem

  1. Prepare 3 nodes with no bound PVCs.
  2. Create a StatefulSet, single replica, 7 volumes. This pod should run.
  3. Create another StatefulSet, many replicas, each single volume. Try to bind 21 to 24 volumes in total.
  4. Eventually, a pod will be scheduled on a node trying to bind 8th volume and it will fail to run.

Environment Specifications

Screenshots, Code Blocks, and Logs

Additional Notes

Related to #154 and probably reintroduced in v0.7.0. Haven't tested but it could also apply to bigger Linodes that allows more volumes but counting attached volumes incorrectly.


For general help or discussion, join the Kubernetes Slack team channel #linode. To sign up, use the Kubernetes Slack inviter.

The Linode Community is a great place to get additional support.

nesv commented 2 months ago

Thank you for filing this bug report, @codestation!

As you have pointed out, since you are using Linodes with <= 8GB of RAM, the total number of volumes that can be attached is 8; this includes locally attached "instance" disks (typically only 1, used for boot and root), and 7 additional instance disks and/or block storage volumes.

In a 3-node cluster, with the instance sizes you have specified, I would only expect you to be able to attach a total of 21 volumes across the cluster.

and probably reintroduced in v0.7.0. Haven't tested but it could also apply to bigger Linodes that allows more volumes but counting attached volumes incorrectly.

When these changes were tested, I used an array of instance sizes from the 1GB "Nanode" all the way up to a 96GB Linode, and in all cases, the tests were successfully able to attach the maximum expected number of volumes, minus 1 to account for the local instance disk. I also made sure to set the number of replicas to be 1 more than the expected number of attachments per node (the statefulsets targeted nodes of different instance sizes), and in all cases, that additional replica pod was scheduled to a node, but unable to start, due to the missing PVC, which could not be attached.

Prior to v0.7.0, there was a hard maximum of 8 volumes total (instance disks + block storage volumes) that could be attached to any node. v0.7.0 changed the way the block storage volumes were attached, to align with the functionality supported by the Linode API, and allowed >8 volumes to be attached to nodes with >= 16GB of RAM. In making that change, there was also a pre-flight check that was added, that would prevent attempting to attach a volume if the maximum number of attachments would be exceeded; previously, there was no check, and an unactionable error from the Linode API would be returned directly to the container orchestrator (CO).

The current volume attachment limits are currently in the release notes for v0.7.0, but should also be made present in the README for this repository. I will add an issue to track this. :slightly_smiling_face:

The pod should be relocated to another node since the max number of PVC has been reached.

I don't think rescheduling pods is in the domain of the CSI driver. In my work on this driver, I have been working on bringing the driver into compliance with the latest version of the CSI specification, which indicates that if a volume cannot attached to a node, the RESOURCE_EXHAUSTED error code should be returned. If I have misinterpreted the specification, that is definitely grounds for a bug fix. :smile:

According to the documentation local disks and block storage are counted aganist the max volumes that can be attached to a node.

Correct, local "instance" disks and block device volumes are counted against the limit of attached volumes. However, that documentation does not indicate that the maximum number of volumes scales with the amount of memory presented to the instance, up to a maximum of 64 total volume attachments; likely because these are numbers that will change. These numbers are internal to the virtualization platform at Linode, and they are copied/surfaced in this driver's code to preempt any attachments that would fail.


In your reproduction steps, exactly how many volumes are being created?

codestation commented 2 months ago

In my repro i got to 9 attached volumes before getting stuck. I just tried the following in Linode.

According the comment on max_volumes_per_node in NodeGetInfoResponse , it says that Maximum number of volumes that controller can publish to the node.. So i assume that if maxVolumeAttachments returns 8, then the controller expects to be able to attach 8 volumes in total, but this is false since the boot volume counts as 1, so there is really only 7 volumes that can be attached (and probably less in the future, now that swap support is in beta for k8s).

IMO the solution could be that either the NodeGetInfo method return volumes_per_node - local_volumes, or that the controller is aware of the local volumes (not sure if possible).

I am gonna try to test the first option in the next days to see how it goes (fork the repo, use a naive maxVolumeAttachments - 1, then deploy under a different storage class name).

nesv commented 2 months ago

According the comment on max_volumes_per_node in NodeGetInfoResponse , it says that Maximum number of volumes that controller can publish to the node.. So i assume that if maxVolumeAttachments returns 8, then the controller expects to be able to attach 8 volumes in total, but this is false since the boot volume counts as 1, so there is really only 7 volumes that can be attached (and probably less in the future, now that swap support is in beta for k8s).

That sounds right to me.

IMO the solution could be that either the NodeGetInfo method return volumes_per_node - local_volumes, or that the controller is aware of the local volumes (not sure if possible).

It is possible to get the number of instance disks and volumes currently attached to an instance through the Linode API, so this could be done by both the controller and the node plugin.

Looking through the code, there is the LinodeControllerServer.canAttach method. That method is likely where any changes should go to fix this off-by-one error. I can get a fix for that whipped up pretty quickly.

nesv commented 2 months ago

@codestation I have just merged in the patch that will hopefully fix this bug. Thank you for being patient while we got this sorted out, and thank you for filing a bug! :smile:

EDIT: The workflow to cut the release just finished. Please give v0.8.3 a whirl!