Closed tgross closed 2 years ago
While testing the patches for #10927 I've determined this isn't really specific to node drain with migrations, but can happen any time controller unpublish takes longer than the time it takes for an allocation to be rescheduled and placed on the new client node.
The CSI specification has this to say about concurrency:
In general the Cluster Orchestrator (CO) is responsible for ensuring that there is no more than one call “in-flight” per volume at a given time. However, in some circumstances, the CO MAY lose state (for example when the CO crashes and restarts), and MAY issue multiple calls simultaneously for the same volume. The plugin SHOULD handle this as gracefully as possible. The error code ABORTED MAY be returned by the plugin in this case (see the Error Scheme section for details).
In Nomad's implementation of the unpublish workflow, we push unpublish RPCs from the client and then from the server once the volumewatcher notices a checkpoint. So we're violating the "should" aspects of the above which the plugins need to (and do!) tolerate. But this becomes a problem when the scheduler tries to replace an existing allocation if the controller unpublish RPCs have not yet completed.
I reproduced this today using the gcp-compute-persistent-disk-csi-driver
and the following jobspec consuming a volume:
On a node drain or job update that forces a reschedule, the replacement alloc will have this event from the csi_hook
:
Recent Events:
Time Type Description
2022-01-28T11:49:27-05:00 Setup Failure failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: could not claim volume webdata[1]: rpc error: rpc error: controller publish: attach volume: controller attach volume: CSI.ControllerAttachVolume: controller plugin returned an internal error, check the plugin allocation logs for more information: rpc error: code = Internal desc = unknown Attach error: failed when waiting for zonal op: operation operation-1643388564268-5d6a73799e68b-74228985-80ef1d67 failed (RESOURCE_IN_USE_BY_ANOTHER_RESOURCE): The disk resource 'projects/$redacted/zones/us-east1-c/disks/csi-disk-1' is already being used by 'projects/$redacted/zones/us-east1-c/instances/client-2'
2022-01-28T11:49:16-05:00 Received Task received by client
In the GCP CSI controller plugin logs for that disk, we see the following sequence:
ControllerUnpublish
gets calledControllerUnpublish
gets called 3 more times all within the same second (which is weird). But those are rejected with "An operations with the given Volume ID already exists", as we'd expect.ControllerPublishVolume
gets called for the new client and fails because the disk is still in use.ControllerUnpublish
completesControllerUnpublishVolume
again, which immediately returns because the disk never got attached to that clientSeeing this in more detail makes me realize that in a typical case we should be able to simply retry the csi_hook
's initial claim RPC for a little while to see if the volume becomes available. We should probably thread this workflow through volumewatcher as well so that we're only sending one request at a time.
I've been asked some questions out-of-band on this issue and want to clarify some details.
csi_hook
), which is different than how k8s does it b/c that's the only way the Nomad architecture allowed us to do so without blocking the whole plan applier for minutes at a time.My thinking at one point was to add a new claim state for "requesting claim" that would trigger the volumewatcher, and then have the claim RPC poll for the update. That would serialize the operations but not necessarily in the correct order! Working around that problem was creating a bunch of complexity around maintaining the state across leader elections. K8s controller loops don't need to care about leader elections b/c they use external state (etcd), which simplifies this specific problem (mostly at the cost of performance).
So the proposed fix I'm working on is in two parts:
csi_hook
on the client retry the RPC that handles claims if we get an "already in use" error (probably for n minutes so we can fail an alloc eventually?). This likely fixes the specific bug by itself but doesn't bring us into spec compliance and doesn't give us any data protection in the face of buggy plugins that return the wrong errors.Have the RPC that handles that claim verify the claim viability before sending any RPCs to plugins, and return an "already in use" error for the client to retry. This prevents us from sending RPCs out-of-order if we know they're going to fail anyways.
Ok, looks like we actually do have this already, but it uses the same test as the feasibility checker which has to be somewhat less strict because it needs to account for removing the previous version of the job. So we need to split the behaviors into a schedule-time and a claim-time check.
I've got a very rough first draft of this fix in csi-enforce-usage-at-claim-time
.
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
We have some existing issues around CSI and node drain which we should be closing out in the next few weeks, but this issue is for tracking a specific detail of the problem that we're not planning on fixing in the 0.12.x timeframe.
Migrating allocations during a node drain waits for the allocation to be placed on the new node before shutting down the drained allocation. Because the allocation on the old node is still claiming the volume (and in fact has it attached), the new allocation can't start and will always result in a failure for single-attachment volumes. But once the old alloc has stopped and the volume has been detached, the next attempt to place will succeed. Note that migrating an application with a specific single-attachment volume to a new node would always create an outage for the application no matter what we could come up with Nomad. So avoiding the error is a nice-to-have in that circumstance but it doesn't fundamentally change anything.
The error will look like this, with some details depending on the specific storage provider:
The only way to avoid this problem is to change the migration code so that allocations with CSI volumes get handled specially and get stopped before migrating. This gets complicated for the update stanza and rolling updates. It also conflicts with the
migrate
option for ephemeral disk, so there's some design tradeoffs to make here where an allocation wouldn't be able to have both CSI volumes and ephemeral disk that migrates.