Open ygersie opened 1 month ago
Hi @ygersie! Yeah that's not totally unexpected. The ControllerUnpublishVolume RPC is required to be idempotent by the specification, so this should be safe. I have some thoughts around using "sagas" to tighten up this behavior but so far all the draft designs I've considered end up pushing a lot more Raft logs, so there are tradeoffs there.
But I'll mark this for further examination in the meanwhile.
@tgross thanks! I wasn't sure if this was an issue so wanted to post. If this is an idempotent call feel free to close this issue 👍
Nomad version
1.8.2+ent
Problem description
When looking into an issue that still sometimes leads to stuck CSI volumes I ran into the following scenario. When I stop an allocation and it is rescheduled onto the same node I see events on 2 CSI controller plugins instead of just 1. It looks like the
ControllerUnpublishVolume
RPC is called a second time incorrectly. I'm not sure if this is ever going to be causing problems but it's at least somewhat unexpected.Logs
client plugin
csi-ebs-controller plugin logs 1
csi-ebs-controller plugin logs 2
nomad client logs
If the allocation moves to a different node things I don't see the duplicate
ControllerUnpublishVolume
, however I do always see these on the node that is going to run the replacement allocation:I guess, although logged as an error, that's not really a problem as the client keeps retrying to gain a Claim but the volume wasn't completely Unpublished yet.