CSI allocations may fail first placement after drain/reschedule

tgross commented 4 years ago

We have some existing issues around CSI and node drain which we should be closing out in the next few weeks, but this issue is for tracking a specific detail of the problem that we're not planning on fixing in the 0.12.x timeframe.

Migrating allocations during a node drain waits for the allocation to be placed on the new node before shutting down the drained allocation. Because the allocation on the old node is still claiming the volume (and in fact has it attached), the new allocation can't start and will always result in a failure for single-attachment volumes. But once the old alloc has stopped and the volume has been detached, the next attempt to place will succeed. Note that migrating an application with a specific single-attachment volume to a new node would always create an outage for the application no matter what we could come up with Nomad. So avoiding the error is a nice-to-have in that circumstance but it doesn't fundamentally change anything.

The error will look like this, with some details depending on the specific storage provider:

2020-08-07T14:55:35-04:00  Setup Failure  failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: rpc error: rpc error: controller publish: attach volume: rpc error: controller attach volume: volume "vol-07b2ea432ef9a8dce" is already published at node "i-0685ed47dc21535db" but with capabilities or a read_only setting incompatible with this request: rpc error: code = AlreadyExists desc = Resource already exists

The only way to avoid this problem is to change the migration code so that allocations with CSI volumes get handled specially and get stopped before migrating. This gets complicated for the update stanza and rolling updates. It also conflicts with the migrate option for ephemeral disk, so there's some design tradeoffs to make here where an allocation wouldn't be able to have both CSI volumes and ephemeral disk that migrates.

tgross commented 2 years ago

While testing the patches for #10927 I've determined this isn't really specific to node drain with migrations, but can happen any time controller unpublish takes longer than the time it takes for an allocation to be rescheduled and placed on the new client node.

The CSI specification has this to say about concurrency:

In general the Cluster Orchestrator (CO) is responsible for ensuring that there is no more than one call “in-flight” per volume at a given time. However, in some circumstances, the CO MAY lose state (for example when the CO crashes and restarts), and MAY issue multiple calls simultaneously for the same volume. The plugin SHOULD handle this as gracefully as possible. The error code ABORTED MAY be returned by the plugin in this case (see the Error Scheme section for details).

In Nomad's implementation of the unpublish workflow, we push unpublish RPCs from the client and then from the server once the volumewatcher notices a checkpoint. So we're violating the "should" aspects of the above which the plugins need to (and do!) tolerate. But this becomes a problem when the scheduler tries to replace an existing allocation if the controller unpublish RPCs have not yet completed.

I reproduced this today using the gcp-compute-persistent-disk-csi-driver and the following jobspec consuming a volume:

jobspec

```hcl job "httpd" { datacenters = ["dc1"] group "web" { count = 2 volume "csi_data" { type = "csi" read_only = false source = "webdata" access_mode = "single-node-writer" attachment_mode = "file-system" per_alloc = true } network { mode = "bridge" port "www" { to = 8001 } } task "http" { driver = "docker" config { image = "busybox:1" command = "httpd" args = ["-v", "-f", "-p", "8001", "-h", "/srv"] ports = ["www"] } volume_mount { volume = "csi_data" destination = "/srv" read_only = false } resources { cpu = 128 memory = 128 } } } } ```

On a node drain or job update that forces a reschedule, the replacement alloc will have this event from the csi_hook:

Recent Events:
Time                       Type           Description
2022-01-28T11:49:27-05:00  Setup Failure  failed to setup alloc: pre-run hook "csi_hook" failed: claim volumes: could not claim volume webdata[1]: rpc error: rpc error: controller publish: attach volume: controller attach volume: CSI.ControllerAttachVolume: controller plugin returned an internal error, check the plugin allocation logs for more information: rpc error: code = Internal desc = unknown Attach error: failed when waiting for zonal op: operation operation-1643388564268-5d6a73799e68b-74228985-80ef1d67 failed (RESOURCE_IN_USE_BY_ANOTHER_RESOURCE): The disk resource 'projects/$redacted/zones/us-east1-c/disks/csi-disk-1' is already being used by 'projects/$redacted/zones/us-east1-c/instances/client-2'
2022-01-28T11:49:16-05:00  Received       Task received by client

In the GCP CSI controller plugin logs for that disk, we see the following sequence:

16:52:43.17: ControllerUnpublish gets called
16:52:43.18-19: ControllerUnpublish gets called 3 more times all within the same second (which is weird). But those are rejected with "An operations with the given Volume ID already exists", as we'd expect.
16:52:43.29: the controller plugin logs that it's detaching the disk from the client
16:52:46.46: ControllerPublishVolume gets called for the new client and fails because the disk is still in use.
16:52:46.75: the original ControllerUnpublish completes
16:52:46.76: the new (failed) alloc ends up calling ControllerUnpublishVolume again, which immediately returns because the disk never got attached to that client
and then the next allocation succeeds

controller logs

``` I0128 16:52:43.176189 1 utils.go:55] /csi.v1.Controller/ControllerUnpublishVolume called with request: volume_id:"projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/disks/csi-disk-1" node_id:"projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/instances/client-1" I0128 16:52:43.184631 1 utils.go:55] /csi.v1.Controller/ControllerUnpublishVolume called with request: volume_id:"projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/disks/csi-disk-1" node_id:"projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/instances/client-1" E0128 16:52:43.184711 1 utils.go:58] /csi.v1.Controller/ControllerUnpublishVolume returned with error: rpc error: code = Aborted desc = An operation with the given Volume ID projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/instances/client-1/projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/disks/csi-disk-1 already exists I0128 16:52:43.188513 1 utils.go:55] /csi.v1.Controller/ControllerUnpublishVolume called with request: volume_id:"projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/disks/csi-disk-1" node_id:"projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/instances/client-1" E0128 16:52:43.188590 1 utils.go:58] /csi.v1.Controller/ControllerUnpublishVolume returned with error: rpc error: code = Aborted desc = An operation with the given Volume ID projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/instances/client-1/projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/disks/csi-disk-1 already exists I0128 16:52:43.191861 1 utils.go:55] /csi.v1.Controller/ControllerUnpublishVolume called with request: volume_id:"projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/disks/csi-disk-1" node_id:"projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/instances/client-1" E0128 16:52:43.191937 1 utils.go:58] /csi.v1.Controller/ControllerUnpublishVolume returned with error: rpc error: code = Aborted desc = An operation with the given Volume ID projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/instances/client-1/projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/disks/csi-disk-1 already exists I0128 16:52:43.292029 1 gce-compute.go:469] Detaching disk csi-disk-1 from client-1 E0128 16:52:46.466234 1 utils.go:58] /csi.v1.Controller/ControllerPublishVolume returned with error: rpc error: code = Internal desc = unknown Attach error: failed when waiting for zonal op: operation operation-1643388762968-5d6a74371d288-d8b28028-8a52c70b failed (RESOURCE_IN_USE_BY_ANOTHER_RESOURCE): The disk resource 'projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/disks/csi-disk-1' is already being used by 'projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/instances/client-1' I0128 16:52:46.754231 1 controller.go:391] ControllerUnpublishVolume succeeded for disk Key{"csi-disk-1", zone: "us-east1-c"} from node projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/instances/client-1 I0128 16:52:46.762944 1 utils.go:55] /csi.v1.Controller/ControllerUnpublishVolume called with request: volume_id:"projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/disks/csi-disk-1" node_id:"projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/instances/client-1" I0128 16:52:46.831505 1 controller.go:382] ControllerUnpublishVolume succeeded for disk Key{"csi-disk-1", zone: "us-east1-c"} from node projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/instances/client-1. Already not attached. I0128 16:53:17.127661 1 utils.go:55] /csi.v1.Controller/ControllerPublishVolume called with request: volume_id:"projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/disks/csi-disk-1" node_id:"projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/instances/client-1" volume_capability: access_mode: > I0128 16:53:17.127733 1 gce-compute.go:99] Repairing potentially underspecified volume key Key{"csi-disk-1", zone: "us-east1-c"} I0128 16:53:17.127755 1 gce-compute.go:181] Getting disk Key{"csi-disk-1", zone: "us-east1-c"} I0128 16:53:17.286926 1 gce-compute.go:442] Attaching disk Key{"csi-disk-1", zone: "us-east1-c"} to client-1 I0128 16:53:20.768843 1 gce-compute.go:568] Waiting for attach of disk csi-disk-1 to instance client-1 to complete... I0128 16:53:25.769038 1 gce-compute.go:571] Polling for attach of disk csi-disk-1 to instance client-1 to complete for 5.000154359s I0128 16:53:25.769075 1 gce-compute.go:181] Getting disk Key{"csi-disk-1", zone: "us-east1-c"} I0128 16:53:25.866429 1 controller.go:331] ControllerPublishVolume succeeded for disk Key{"csi-disk-1", zone: "us-east1-c"} to instance projects/hc-5c8332f181ee487c8483a4d3bb2/zones/us-east1-c/instances/client-1 ```

Seeing this in more detail makes me realize that in a typical case we should be able to simply retry the csi_hook's initial claim RPC for a little while to see if the volume becomes available. We should probably thread this workflow through volumewatcher as well so that we're only sending one request at a time.

tgross commented 2 years ago

I've been asked some questions out-of-band on this issue and want to clarify some details.

We make requests for claims from the client after placement (in the csi_hook), which is different than how k8s does it b/c that's the only way the Nomad architecture allowed us to do so without blocking the whole plan applier for minutes at a time.
This bug is because the client-driven claims don't serialize the operations for a volume the way the spec requires.
For a single-writer volume this means that we try to attach a volume that hasn't been detached and fail.
For a multi-writer volume, everything works just fine, but then you have multiple writers (as the user has requested! :grin: )
(And the same applies to single-reader and multi-reader in the case of read-only volumes.)

My thinking at one point was to add a new claim state for "requesting claim" that would trigger the volumewatcher, and then have the claim RPC poll for the update. That would serialize the operations but not necessarily in the correct order! Working around that problem was creating a bunch of complexity around maintaining the state across leader elections. K8s controller loops don't need to care about leader elections b/c they use external state (etcd), which simplifies this specific problem (mostly at the cost of performance).

So the proposed fix I'm working on is in two parts:

Have the csi_hook on the client retry the RPC that handles claims if we get an "already in use" error (probably for n minutes so we can fail an alloc eventually?). This likely fixes the specific bug by itself but doesn't bring us into spec compliance and doesn't give us any data protection in the face of buggy plugins that return the wrong errors.
Have the RPC that handles that claim verify the claim viability before sending any RPCs to plugins, and return an "already in use" error for the client to retry. This prevents us from sending RPCs out-of-order if we know they're going to fail anyways.

tgross commented 2 years ago

Have the RPC that handles that claim verify the claim viability before sending any RPCs to plugins, and return an "already in use" error for the client to retry. This prevents us from sending RPCs out-of-order if we know they're going to fail anyways.

Ok, looks like we actually do have this already, but it uses the same test as the feasibility checker which has to be somewhat less strict because it needs to account for removing the previous version of the job. So we need to split the behaviors into a schedule-time and a claim-time check.

I've got a very rough first draft of this fix in csi-enforce-usage-at-claim-time.

github-actions[bot] commented 2 years ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

hashicorp / nomad

CSI allocations may fail first placement after drain/reschedule #8609