Closed dguendisch closed 2 months ago
Hmm, arcus lifo queuing is the ultimate problem here. We'd fixed a bunch of these races with the error backoff (I think it was) but it seems there's still a few out there.
I'm not sure what the right fix is TBH. Since the volume is never marked on the old node, the attacher won't know that it needs to be detached.
A fix in arcus (the GCE/PD control plane) is actually in process of rolling out. This fix will enforce fifo of operations and will merge things like op 1 and op 2 in your example. The rollout should be complete in about a month.
The workarounds we've discussed for this at the CSI layer all of various levels of hackery and danger around them, so I think it's best to just wait for the arcus fix.
Thank you for this follow up! Glad to hear about arcus enforcing fifo soon 👍
qq: is the above fix meanwhile rolled out?
qq: is the above fix meanwhile rolled out?
@mattcary @msau42 any news about the above question?
qq: is the above fix meanwhile rolled out?
@mattcary @msau42 any news about the above question?
ping
The fix is currently rolling out. Should be complete within the next few weeks.
Thanks @msau42 ! Please let us know in this issue when the fix rollout is complete. We continue to see the above-described issue in our GCP clusters.
Is there a way for us to track this fix? Any issue or commit that we can look forward to?
Thank you.
There's no public tracker for the arcus rollout, unfortunately. There were some problems detected late last year that had to be fixed, and the final rollout is in progress now.
Any update on the rollout of the fix ?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
/reopen /remove-lifecycle rotten
@ialidzhikov: You can't reopen an issue/PR unless you authored it or you are a collaborator.
@msau42 @mattcary can you confirm that the fix rollout in GCP is complete?
Sorry for dropping this. The fix was rolled out by the end of January. Any race conditions seen recently are due to something else and may be worth looking into fixing in this driver.
/close
Thank you for the update!
We frequently run into situations where a pod's volume cannot be attached to some node Y, because on GCP it is still attached to a node X where the pod was previously located. In K8s there are however no traces of the volume being attached to node X, specifically there is no
volumeattachment
resource mapping the volume to node X and node X'.status.volumesAttached/volumesInUse
has no signs of that volume; this indicates that it (at some point in time) was successfully detached from X.After a lot of digging (in gcp-csi-driver and GCP audit logs) I found the following race condition to happen presumably because there is no ordering of sequential operations or locking of ongoing operations happening, this is the ordered sequence of events:
gcp-operation-ID: 1
gcp-operation-ID: 2
)gcp-operation-ID: 3
)gcp-operation-ID: 1
(resurrected from the dead) finally succeeds; disk is attached to node X again