"error performing RPC to server which is not safe to automatically retry" error when detaching a CSI volume can lead to Nomad job in unschedulable state

Nomad version

Nomad v1.5.8
Revision b3a653dff306e57edcd63bce6c4a54c0918f2f3d

Operating system and Environment details

Ubuntu 22.04.1 LTS on AWS EC2

Issue

I've been testing out Nomad 1.5.8, the changes from #17996 seem to be a big improvement in the reliability of our CSI volumes (thank you @tgross!)

however, in this testing I've found there's still a race condition / edge case that can lead to a CSI volume getting into a "wedged" state where Nomad thinks the volume is still allocated to a node that's been drained, and refuses to reschedule the associated job to a non-drained node.

specifically, if the CSI controller process is restarted (in our case, to pick up new AWS creds, see below) there's a brief window of downtime where Nomad's RPCs to it will fail. those RPCs should be retryable, if I understand CSI's idempotency model correctly, however Nomad is flagging the error as not safe to retry and giving up immediately.

Reproduction steps

have a Nomad client node with one or more CSI volumes mounted (more volumes should increase the likelihood of triggering the bug)
drain that Nomad node
(race condition here): restart the CSI controller process at just the right moment such that it causes the CSI postrun hook to fail its ControllerDetachVolume RPC to the CSI controller

Expected Result

looking at a comment in csi_endpoint.go it seems like an Unavailable response is intended to be retried. however, from my logs (below) it appears that the Unavailable response is being treated as a fatal error that is not retried.

Actual Result

nomad volume status shows the affected volume is still allocated to the drained node

ID                   = nb-pg-b22942cc354f11ee9d8d2329f0db0616[0]
Name                 = nb-pg-b22942cc354f11ee9d8d2329f0db0616[0]
External ID          = vol-043832224eaa28a03
Plugin ID            = aws-ebs0
Provider             = ebs.csi.aws.com
Version              = v1.11.4
Schedulable          = true
Controllers Healthy  = 1
Controllers Expected = 1
Nodes Healthy        = 36
Nodes Expected       = 36
Access Mode          = single-node-writer
Attachment Mode      = file-system
Mount Options        = fs_type: xfs
Namespace            = default

Topologies
Topology  Segments
01        topology.ebs.csi.aws.com/zone=us-west-2a

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created   Modified
c8ac2ff1  5ecdf87e  postgres    0        stop     complete  1h5m ago  15m29s ago

nomad job status shows the affected job has an allocation in pending status due to that hanging volume allocation

ID            = nb-pg-b22942cc354f11ee9d8d2329f0db0616
Name          = nb-pg-b22942cc354f11ee9d8d2329f0db0616
Submit Date   = 2023-08-07T18:24:53Z
Type          = service
Priority      = 30
Datacenters   = dev-zackelan-us-west-2
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
postgres    0       1         0        0       1         0     0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
8defbc07  4111be92  postgres    0        run      pending   17m58s ago  12m58s ago
c8ac2ff1  5ecdf87e  postgres    0        stop     complete  1h7m ago    17m47s ago

Job file

we deploy the CSI controller process (as well as the per-node process) as Nomad jobs of their own

we give the CSI controller the AWS credentials it needs using Vault and a template with change_mode = "restart", which results in periodic restarts of the controller to pick up new credentials.

relevant snippet of the Nomad job for the CSI controller process:

vault {
  policies    = ["derive-aws-creds"]
  change_mode = "restart"
}

template {
  env         = true
  change_mode = "restart"
  destination = "${NOMAD_SECRETS_DIR}/aws.env"

  data = <<-EOH
  {{ with secret "aws/creds/${var.aws.iam_role}" -}}
    AWS_ACCESS_KEY_ID="{{ .Data.access_key | toJSON }}"
    AWS_SECRET_ACCESS_KEY="{{ .Data.secret_key | toJSON }}"
    AWS_SESSION_TOKEN="{{ .Data.security_token | toJSON }}"
  {{- end -}}
  EOH
}

Nomad Client logs

2023-08-07T19:14:48Z: Ctrl-C to stop monitoring: will not cancel the node drain
2023-08-07T19:14:48Z: Node "5ecdf87e-9a39-0c70-b52d-82f4451cb81a" drain strategy set
2023-08-07T19:14:49Z: Alloc "feae0b42-4c03-dd14-17d8-95c237a643a5" marked for migration
2023-08-07T19:14:49Z: Alloc "c8ac2ff1-adc5-1987-5d4a-a6782d08b369" marked for migration
2023-08-07T19:14:49Z: Alloc "99d0b272-3292-02f4-7093-dc43af4e987d" marked for migration
2023-08-07T19:14:49Z: Alloc "be5919a6-c4e2-75f9-205c-2c6aff7e7f36" marked for migration
2023-08-07T19:14:49Z: Alloc "be5919a6-c4e2-75f9-205c-2c6aff7e7f36" draining
2023-08-07T19:14:49Z: Alloc "feae0b42-4c03-dd14-17d8-95c237a643a5" draining
2023-08-07T19:14:49Z: Alloc "c8ac2ff1-adc5-1987-5d4a-a6782d08b369" draining
2023-08-07T19:14:49Z: Alloc "99d0b272-3292-02f4-7093-dc43af4e987d" draining

<snip a bunch of unrelated logs from the tasks being drained>

2023-08-07T19:14:49Z: Alloc "be5919a6-c4e2-75f9-205c-2c6aff7e7f36" status running -> complete
2023-08-07T19:14:50Z: Drain complete for node 5ecdf87e-9a39-0c70-b52d-82f4451cb81a
2023-08-07T19:14:50Z: Alloc "99d0b272-3292-02f4-7093-dc43af4e987d" status running -> complete
2023-08-07T19:14:50Z: Alloc "feae0b42-4c03-dd14-17d8-95c237a643a5" status running -> complete
2023-08-07T19:14:50Z: Alloc "c8ac2ff1-adc5-1987-5d4a-a6782d08b369" status running -> complete
2023-08-07T19:14:50Z: All allocations on node "5ecdf87e-9a39-0c70-b52d-82f4451cb81a" have stopped
2023-08-07T19:15:00.100Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: could not detach from controller: controller detach volume: CSI.ControllerDetachVolume: rpc error: code = Unavailable desc = error reading from server: EOF" rpc=CSIVolume.Unpublish server=10.29.126.200:4647
2023-08-07T19:15:00.100Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: could not detach from controller: controller detach volume: CSI.ControllerDetachVolume: rpc error: code = Unavailable desc = error reading from server: EOF" rpc=CSIVolume.Unpublish server=10.29.126.200:4647
| hook "csi_hook" failed: 1 error occurred:
error=
2023-08-07T19:15:00.100Z [ERROR] client.alloc_runner: postrun failed: alloc_id=c8ac2ff1-adc5-1987-5d4a-a6782d08b369
| \t* rpc error: rpc error: could not detach from controller: controller detach volume: CSI.ControllerDetachVolume: rpc error: code = Unavailable desc = error reading from server: EOF
|

CSI controller logs

here are concurrent logs from the CSI controller - this is showing the node that was drained, with four CSI volumes attached, corresponding to the four allocations that were drained above.

three of the volumes were detached successfully, one (vol-043832224eaa28a03) was not. that volume's detachment was interrupted by the controller being restarted, and that volume ID correlates with the alloc ID (c8ac2ff1) from the "postrun failed" in Nomad logs above.

I0807 19:14:49.803613       1 controller.go:383] ControllerUnpublishVolume: called with args {VolumeId:vol-00a94bcf268b51486 NodeId:i-0a44102e893045291 Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0807 19:14:50.227798       1 cloud.go:671] Waiting for volume "vol-00a94bcf268b51486" state: actual=detaching, desired=detached
I0807 19:14:51.286812       1 cloud.go:671] Waiting for volume "vol-00a94bcf268b51486" state: actual=detaching, desired=detached
I0807 19:14:53.145811       1 cloud.go:671] Waiting for volume "vol-00a94bcf268b51486" state: actual=detaching, desired=detached
I0807 19:14:56.454050       1 controller.go:397] [Debug] ControllerUnpublishVolume: volume vol-00a94bcf268b51486 detached from node i-0a44102e893045291
I0807 19:14:56.456308       1 controller.go:383] ControllerUnpublishVolume: called with args {VolumeId:vol-043832224eaa28a03 NodeId:i-0a44102e893045291 Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0807 19:14:56.847945       1 cloud.go:671] Waiting for volume "vol-043832224eaa28a03" state: actual=detaching, desired=detached
I0807 19:14:57.917403       1 cloud.go:671] Waiting for volume "vol-043832224eaa28a03" state: actual=detaching, desired=detached
I0807 19:14:59.785453       1 cloud.go:671] Waiting for volume "vol-043832224eaa28a03" state: actual=detaching, desired=detached
I0807 19:15:00.212534       1 driver.go:73] Driver: ebs.csi.aws.com Version: v1.11.4
I0807 19:15:00.212639       1 controller.go:81] [Debug] Retrieving region from metadata service
I0807 19:15:00.212744       1 metadata.go:85] retrieving instance data from ec2 metadata
I0807 19:15:00.215213       1 metadata.go:92] ec2 metadata is available
I0807 19:15:00.216316       1 metadata_ec2.go:25] regionFromSession
I0807 19:15:00.219671       1 driver.go:143] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0807 19:15:19.423401       1 controller.go:383] ControllerUnpublishVolume: called with args {VolumeId:vol-08ab897a2b3dd92ef NodeId:i-0a44102e893045291 Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0807 19:15:19.844205       1 cloud.go:671] Waiting for volume "vol-08ab897a2b3dd92ef" state: actual=detaching, desired=detached
I0807 19:15:20.906344       1 cloud.go:671] Waiting for volume "vol-08ab897a2b3dd92ef" state: actual=detaching, desired=detached
I0807 19:15:22.759413       1 cloud.go:671] Waiting for volume "vol-08ab897a2b3dd92ef" state: actual=detaching, desired=detached
I0807 19:15:26.061428       1 controller.go:397] [Debug] ControllerUnpublishVolume: volume vol-08ab897a2b3dd92ef detached from node i-0a44102e893045291
I0807 19:15:26.064167       1 controller.go:383] ControllerUnpublishVolume: called with args {VolumeId:vol-0a692f7a63bd5a09a NodeId:i-0a44102e893045291 Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
I0807 19:15:26.419407       1 cloud.go:671] Waiting for volume "vol-0a692f7a63bd5a09a" state: actual=detaching, desired=detached
I0807 19:15:27.478103       1 cloud.go:671] Waiting for volume "vol-0a692f7a63bd5a09a" state: actual=detaching, desired=detached
I0807 19:15:29.341109       1 cloud.go:671] Waiting for volume "vol-0a692f7a63bd5a09a" state: actual=detaching, desired=detached
I0807 19:15:32.643958       1 controller.go:397] [Debug] ControllerUnpublishVolume: volume vol-0a692f7a63bd5a09a detached from node i-0a44102e893045291

hashicorp / nomad