Open zackelan opened 1 year ago
Thank you for the detailed report @zackelan. At first sight your analysis of it being a retryable error makes sense, but we will need some time to investigate it furter.
I have placed this issue in our board so it can be properly roadmaped.
Let us know if you find any extra relevant information 🙂
Nomad version
Operating system and Environment details
Ubuntu 22.04.1 LTS on AWS EC2
Issue
I've been testing out Nomad 1.5.8, the changes from #17996 seem to be a big improvement in the reliability of our CSI volumes (thank you @tgross!)
however, in this testing I've found there's still a race condition / edge case that can lead to a CSI volume getting into a "wedged" state where Nomad thinks the volume is still allocated to a node that's been drained, and refuses to reschedule the associated job to a non-drained node.
specifically, if the CSI controller process is restarted (in our case, to pick up new AWS creds, see below) there's a brief window of downtime where Nomad's RPCs to it will fail. those RPCs should be retryable, if I understand CSI's idempotency model correctly, however Nomad is flagging the error as not safe to retry and giving up immediately.
Reproduction steps
have a Nomad client node with one or more CSI volumes mounted (more volumes should increase the likelihood of triggering the bug)
drain that Nomad node
(race condition here): restart the CSI controller process at just the right moment such that it causes the CSI postrun hook to fail its ControllerDetachVolume RPC to the CSI controller
Expected Result
looking at a comment in csi_endpoint.go it seems like an Unavailable response is intended to be retried. however, from my logs (below) it appears that the Unavailable response is being treated as a fatal error that is not retried.
Actual Result
nomad volume status
shows the affected volume is still allocated to the drained nodenomad job status
shows the affected job has an allocation inpending
status due to that hanging volume allocationJob file
we deploy the CSI controller process (as well as the per-node process) as Nomad jobs of their own
we give the CSI controller the AWS credentials it needs using Vault and a template with
change_mode = "restart"
, which results in periodic restarts of the controller to pick up new credentials.relevant snippet of the Nomad job for the CSI controller process:
Nomad Client logs
CSI controller logs
here are concurrent logs from the CSI controller - this is showing the node that was drained, with four CSI volumes attached, corresponding to the four allocations that were drained above.
three of the volumes were detached successfully, one (
vol-043832224eaa28a03
) was not. that volume's detachment was interrupted by the controller being restarted, and that volume ID correlates with the alloc ID (c8ac2ff1
) from the "postrun failed" in Nomad logs above.