Apparent race condition causes pods to mount secret volumes with no object content

dindurthy commented 1 year ago

What steps did you take and what happened: Apparent race condition causes pods to mount secret volumes with no object content

What did you expect to happen: We expect a pod to start only if a secret volume mounts successfully with the object content

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.] Here's a stab at picking out the relevant logs

secrets-store: "I0914 17:56:45.942699       1 nodeserver.go:353] \"Using gRPC client\" provider=\"aws\" pod=\"archive-importer-gcp-usa-central-1-transmission-7d4665cf4-gf7hq\"",
csi-aws      : "I0914 17:57:58.141005       1 server.go:107] Servicing mount request for pod archive-importer-gcp-usa-central-1-transmission-7d4665cf4-gf7hq in namespace services using service account archive-importer-default with region us-east-1",
secrets-store: "E0914 17:58:45.941230       1 nodeserver.go:235] \"failed to mount secrets store object content\" err=\"rpc error: code = DeadlineExceeded desc = context deadline exceeded\" pod=\"services/archive-importer-gcp-usa-central-1-transmission-7d4665cf4-gf7hq\"",
secrets-store: "I0914 17:58:45.941275       1 nodeserver.go:81] \"unmounting target path as node publish volume failed\" targetPath=\"/var/lib/kubelet/pods/5890fd63-ad08-4b0f-9c44-64de925d9119/volumes/kubernetes.io~csi/archive-importer-shared-secrets/mount\" pod=\"services/archive-importer-gcp-usa-central-1-transmission-7d4665cf4-gf7hq\""
secrets-store: "I0914 17:58:46.634611       1 utils.go:47] \"already mounted to target\" targetPath=\"/var/lib/kubelet/pods/5890fd63-ad08-4b0f-9c44-64de925d9119/volumes/kubernetes.io~csi/archive-importer-gcp-usa-central-1/mount\""
secrets-store: "I0914 17:58:46.634645       1 nodeserver.go:133] \"target path is already mounted\" targetPath=\"/var/lib/kubelet/pods/5890fd63-ad08-4b0f-9c44-64de925d9119/volumes/kubernetes.io~csi/archive-importer-gcp-usa-central-1/mount\" pod=\"services/archive-importer-gcp-usa-central-1-transmission-7d4665cf4-gf7hq\""

The kubelet logs suggest that the mount attempt for secrets for this pod initiates, errors, retries, and then succeeds. Here's an example:

"I0914 17:56:45.831962    2630 reconciler.go:225] \"operationExecutor.VerifyControllerAttachedVolume started for volume \\\"archive-importer-shared-secrets\\\" (UniqueName: \\\"kubernetes.io/csi/5890fd63-ad08-4b0f-9c44-64de925d9119-archive-importer-shared-secrets\\\") pod \\\"archive-importer-gcp-usa-central-1-transmission-7d4665cf4-gf7hq\\\" (UID: \\\"5890fd63-ad08-4b0f-9c44-64de925d9119\\\") \""
"I0914 17:56:45.937244    2630 reconciler.go:266] \"operationExecutor.MountVolume started for volume \\\"archive-importer-shared-secrets\\\" (UniqueName: \\\"kubernetes.io/csi/5890fd63-ad08-4b0f-9c44-64de925d9119-archive-importer-shared-secrets\\\") pod \\\"archive-importer-gcp-usa-central-1-transmission-7d4665cf4-gf7hq\\\" (UID: \\\"5890fd63-ad08-4b0f-9c44-64de925d9119\\\") \""
"E0914 17:58:45.940859    2630 nestedpendingoperations.go:335] Operation for \"{volumeName:kubernetes.io/csi/5890fd63-ad08-4b0f-9c44-64de925d9119-archive-importer-shared-secrets podName:5890fd63-ad08-4b0f-9c44-64de925d9119 nodeName:}\" failed. No retries permitted until 2022-09-14 17:58:46.440837359 +0000 UTC m=+1221.338406280 (durationBeforeRetry 500ms). Error: MountVolume.SetUp failed for volume \"archive-importer-shared-secrets\" (UniqueName: \"kubernetes.io/csi/5890fd63-ad08-4b0f-9c44-64de925d9119-archive-importer-shared-secrets\") pod \"archive-importer-gcp-usa-central-1-transmission-7d4665cf4-gf7hq\" (UID: \"5890fd63-ad08-4b0f-9c44-64de925d9119\") : rpc error: code = DeadlineExceeded desc = context deadline exceeded"
"I0914 17:58:46.493883    2630 reconciler.go:266] \"operationExecutor.MountVolume started for volume \\\"archive-importer-shared-secrets\\\" (UniqueName: \\\"kubernetes.io/csi/5890fd63-ad08-4b0f-9c44-64de925d9119-archive-importer-shared-secrets\\\") pod \\\"archive-importer-gcp-usa-central-1-transmission-7d4665cf4-gf7hq\\\" (UID: \\\"5890fd63-ad08-4b0f-9c44-64de925d9119\\\") \""
"I0914 17:58:46.642319    2630 operation_generator.go:714] MountVolume.SetUp succeeded for volume \"archive-importer-shared-secrets\" (UniqueName: \"kubernetes.io/csi/5890fd63-ad08-4b0f-9c44-64de925d9119-archive-importer-shared-secrets\") pod \"archive-importer-gcp-usa-central-1-transmission-7d4665cf4-gf7hq\" (UID: \"5890fd63-ad08-4b0f-9c44-64de925d9119\")"

This happens very sporadically when we are scheduling many pods at once on a node, so I am sure resource contention on the underlying node is at play.

Which provider are you using: We are using the csi-secrets-store-provider-aws.

Environment:

Secrets Store CSI Driver version: (use the image tag): v1.1.0
Kubernetes version: (use kubectl version): Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.11-eks-18ef993", GitCommit:"b9628d6d3867ffd84c704af0befd31c7451cdc37", GitTreeState:"clean", BuildDate:"2022-07-06T18:06:23Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

dindurthy commented 1 year ago

Happy to try upgrading if that might help, but I didn't see any issues suggesting it would.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

FirelightFlagboy commented 1 year ago

/remove-lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

FirelightFlagboy commented 1 year ago

/remove-lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

FirelightFlagboy commented 1 year ago

/remove-lifecycle stale

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

FirelightFlagboy commented 6 months ago

/remove-lifecycle stale

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

FirelightFlagboy commented 3 months ago

/remove-lifecycle stale

k8s-triage-robot commented 6 days ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

FirelightFlagboy commented 6 days ago

/remove-lifecycle stale

kubernetes-sigs / secrets-store-csi-driver

Apparent race condition causes pods to mount secret volumes with no object content #1051