Race condition between csi-driver and GCP

dguendisch commented 1 year ago

We frequently run into situations where a pod's volume cannot be attached to some node Y, because on GCP it is still attached to a node X where the pod was previously located. In K8s there are however no traces of the volume being attached to node X, specifically there is no volumeattachment resource mapping the volume to node X and node X' .status.volumesAttached/volumesInUse has no signs of that volume; this indicates that it (at some point in time) was successfully detached from X.

After a lot of digging (in gcp-csi-driver and GCP audit logs) I found the following race condition to happen presumably because there is no ordering of sequential operations or locking of ongoing operations happening, this is the ordered sequence of events:

csi-driver attaches disk to node X; gcp-csi-driver times out but GCP tracked the request gcp-operation-ID: 1
csi-driver attaches disk to node X again; this time it succeeds (gcp-operation-ID: 2)
pod gets rescheduled to another node Y about 2 mins later, so the volume must move from node X to node Y
csi-driver detaches disk from node X and succeeds (gcp-operation-ID: 3)
now gcp-operation-ID: 1 (resurrected from the dead) finally succeeds; disk is attached to node X again
csi-driver tries to attach disk to node Y (because of the pod reschedule) and never succeeds

mattcary commented 1 year ago

Hmm, arcus lifo queuing is the ultimate problem here. We'd fixed a bunch of these races with the error backoff (I think it was) but it seems there's still a few out there.

I'm not sure what the right fix is TBH. Since the volume is never marked on the old node, the attacher won't know that it needs to be detached.

mattcary commented 1 year ago

A fix in arcus (the GCE/PD control plane) is actually in process of rolling out. This fix will enforce fifo of operations and will merge things like op 1 and op 2 in your example. The rollout should be complete in about a month.

The workarounds we've discussed for this at the CSI layer all of various levels of hackery and danger around them, so I think it's best to just wait for the arcus fix.

dguendisch commented 1 year ago

Thank you for this follow up! Glad to hear about arcus enforcing fifo soon 👍

dguendisch commented 1 year ago

qq: is the above fix meanwhile rolled out?

ialidzhikov commented 11 months ago

qq: is the above fix meanwhile rolled out?

@mattcary @msau42 any news about the above question?

ialidzhikov commented 11 months ago

qq: is the above fix meanwhile rolled out?

@mattcary @msau42 any news about the above question?

ping

msau42 commented 10 months ago

The fix is currently rolling out. Should be complete within the next few weeks.

ialidzhikov commented 10 months ago

Thanks @msau42 ! Please let us know in this issue when the fix rollout is complete. We continue to see the above-described issue in our GCP clusters.

jahantech commented 9 months ago

Is there a way for us to track this fix? Any issue or commit that we can look forward to?

Thank you.

mattcary commented 9 months ago

There's no public tracker for the arcus rollout, unfortunately. There were some problems detected late last year that had to be fixed, and the final rollout is in progress now.

adenitiu commented 7 months ago

Any update on the rollout of the fix ?

k8s-triage-robot commented 4 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 2 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/issues/1290#issuecomment-2326072847): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

ialidzhikov commented 2 months ago

/reopen /remove-lifecycle rotten

k8s-ci-robot commented 2 months ago

@ialidzhikov: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/issues/1290#issuecomment-2326196370): >/reopen >/remove-lifecycle rotten Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

ialidzhikov commented 2 months ago

@msau42 @mattcary can you confirm that the fix rollout in GCP is complete?

mattcary commented 2 months ago

Sorry for dropping this. The fix was rolled out by the end of January. Any race conditions seen recently are due to something else and may be worth looking into fixing in this driver.

/close

ialidzhikov commented 2 months ago

Thank you for the update!

kubernetes-sigs / gcp-compute-persistent-disk-csi-driver

Race condition between csi-driver and GCP #1290