kubernetes-csi / external-attacher

Sidecar container that watches Kubernetes VolumeAttachment objects and triggers ControllerPublish/Unpublish against a CSI endpoint
Apache License 2.0
167 stars 179 forks source link

Fix timeout on short CSI calls during attacher initialization #561

Closed Fricounet closed 4 months ago

Fricounet commented 4 months ago

What type of PR is this?

/kind bug

What this PR does / why we need it:

The same context.WithTimeout was used in different places when calling the CSI drivers in the main function. With recent changes, the code execution started taking longer and the context would hit its timeout on the GetPluginCapabilities driver call. For instance with when running with the azuredisk-csi-driver (I've seen this on the aws-ebs and the gcp-pd drivers too):

❯ k logs -c csi-attacher csi-azuredisk-controller-69fdb787c6-k59fw
I0529 12:26:30.367859       1 main.go:109] "Version" version="v4.6.0-0-gc58f4277"
I0529 12:26:30.369489       1 connection.go:234] "Connecting" address="unix:///csi/csi.sock"
I0529 12:26:31.370120       1 common.go:145] "Probing CSI driver for readiness"
I0529 12:26:31.370148       1 connection.go:264] "GRPC call" method="/csi.v1.Identity/Probe" request="{}"
I0529 12:26:31.371908       1 connection.go:270] "GRPC response" response="{\"ready\":{\"value\":true}}" err=null
I0529 12:26:31.371934       1 connection.go:264] "GRPC call" method="/csi.v1.Identity/GetPluginInfo" request="{}"
I0529 12:26:31.372446       1 connection.go:270] "GRPC response" response="{\"name\":\"disk.csi.azure.com\",\"vendor_version\":\"v1.30.1\"}" err=null
I0529 12:26:31.372466       1 main.go:169] "CSI driver name" driver="disk.csi.azure.com"
I0529 12:26:31.373062       1 connection.go:234] "Connecting" address="unix:///csi/csi.sock"
I0529 12:26:32.374364       1 common.go:145] "Probing CSI driver for readiness"
I0529 12:26:32.374396       1 connection.go:264] "GRPC call" method="/csi.v1.Identity/Probe" request="{}"
I0529 12:26:32.374869       1 connection.go:270] "GRPC response" response="{\"ready\":{\"value\":true}}" err=null
I0529 12:26:32.374921       1 connection.go:264] "GRPC call" driver="disk.csi.azure.com" method="/csi.v1.Identity/GetPluginCapabilities" request="{}"
I0529 12:26:32.375015       1 main.go:196] "ServeMux listening" driver="disk.csi.azure.com" address=":22012" metricsPath="/metrics"
I0529 12:26:32.375048       1 connection.go:270] "GRPC response" driver="disk.csi.azure.com" response="{}" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
E0529 12:26:32.375069       1 main.go:207] "Failed to check if the CSI Driver supports the CONTROLLER_SERVICE" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" driver="disk.csi.azure.com"

This change uses a separate context.WithTimeout for each short CSI call using the default csiTimeout defined in main.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Fixed an issue where the attacher would see its context timeout on startup when calling the CSI driver.
k8s-ci-robot commented 4 months ago

Welcome @Fricounet!

It looks like this is your first PR to kubernetes-csi/external-attacher 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-csi/external-attacher has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. :smiley:

k8s-ci-robot commented 4 months ago

Hi @Fricounet. Thanks for your PR.

I'm waiting for a kubernetes-csi member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
huww98 commented 4 months ago

connection.go:234] "Connecting" address="unix:///csi/csi.sock" this seems taking exactly 1 second from your logs. Do you know why? It looks like the CSI response is very fast, and 1 second timeout should be enough for both RPC calls.

xing-yang commented 4 months ago

/ok-to-test

Fricounet commented 4 months ago

@huww98 good point. I had not paid attention that both CSI connection calls were taking 1s each. But I think this might just be the backoff for a failed call to csi-lib-utils Connect kicking in. So its possibly just the csi-driver that failed to respond on the first call and the next call succeeded after the backoff. I will investigate a bit more but I still feel that each csi call that are supposed to be 'short' should use their own timeouts instead of sharing the same one. Especially when there are calls in the middle that don't have those timeouts (like the one to connect to the driver).

Fricounet commented 4 months ago

I've figured where this delay comes from and It is not an issue with the connection to the driver but with a change to the Probe function that is run just after. In the latest version of csi-lib-utils, it was changed to use a Ticker instead of the sleep it was using before between each probe. As a result, the first probe is only launched after 1s instead of immediately resulting in the delay we see

Fricounet commented 4 months ago

I've submitted a PR in https://github.com/kubernetes-csi/csi-lib-utils/pull/175 to restore the previous behavior of the Probe. But this PR stays relevant IMHO because the timeout should be used for short csi calls and having it shared for 2 calls separated by potentially long running calls (Connection and Probe for instance) feels wrong.

jsafrane commented 4 months ago

/lgtm /approve

You're right both about the ProbeForever being slower and that the timeout context should start just before corresponding call.

k8s-ci-robot commented 4 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Fricounet, jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubernetes-csi/external-attacher/blob/master/OWNERS)~~ [jsafrane] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
jsafrane commented 4 months ago

BTW, thanks a ton for testing the latest sidecars and reporting and even fixing the issue!

Fricounet commented 4 months ago

And thanks for all our work on CSI and storage!