kubernetes-sigs / aws-ebs-csi-driver

CSI driver for Amazon EBS https://aws.amazon.com/ebs/
Apache License 2.0
998 stars 800 forks source link

insufficient capacity API possibly errors not propagating into logs and events #2154

Closed joemiller closed 1 month ago

joemiller commented 2 months ago

On 2024-08-30 starting at roughly 1900 UTC and ending many hours later the use1-az2 AZ was unable to provision new io2 volumes. The csi driver logs and events posted back to pending pods was generic, eg:

failed to provision volume with StorageClass "aws-ebs-io2-020000i": rpc error: code = Internal desc = Could not create volume "pvc-307a77e9-0446-4bd0-8a49-8cf87d31d18f": could not create volume in EC2: operation error EC2: CreateVolume, failed to get rate limit token, retry quota exceeded, 0 available, 5 requested

It was not clear what the error was from these logs. This looked fairly generic to our responding engineers. It was not until someone thought to try creating a volume manually via the web console saw the error message "There is currently insufficient capacity to serve the request. Try again at a later time or in a different availability zone"

CleanShot 2024-09-18 at 12 02 46

I don't know if this error was being returned from the AWS API to the CSI driver or not. If so, it would be ideal to have this error propagated back to the pod/pvc events.

AndrewSirenko commented 2 months ago

Hey @joemiller thank you for this feedback.

Would you mind sharing your version of EBS CSI Driver? And any full controller logs that you still have?

The error you show is a retry-rate limit error, which might have obscured the EBS InsufficientVolumeCapacity, and therefore might also obscure the fix in #2155. We will look into this on our side but ruling out an old driver version might be helpful.

In either case, yes you should have seen a clear error message pointing to insufficient Volume Capacity in the controller logs and we will work to fix this.

Thanks!

torredil commented 2 months ago

@AndrewSirenko Which class of errors are considered retryable by our createVolumeRetryer?

The official docs describe InsufficientVolumeCapacity as a terminal error by explicitly recommending to "try again later". Given that, it seems like it would be undesirable to go through the adaptive retrier when the API returns this error code.

I'm thinking it would make more sense to error out without retrying at the sdk level and letting the exponential backoff mechanism implemented in the provisioner do the needful, but would love to hear your thoughts on this.

AndrewSirenko commented 2 months ago

We use the defaults set by the EC2 SDK team for what errors are considered terminal. This info can be found in the retryer package's documentation / code, because as of v1.35.0 we trust those defaults set by the EC2 team.

However, let's confirm the version of the driver, because our retry mechanisms have went through three changes (sdk v1, sdk v2, adaptive retry)?

Regardless of retries or terminal errors, real error EBS messages should be propagated to controller logs, which according to the issue might not have occurred.


The link to go AWS sdk v2 retry module docs; https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/aws/retry

(The retryable errors should be clearly listed in the AWS go sdk v2 repository as well if you grep for the error. I can do this as well once I'm back at my workstation)

ConnorJC3 commented 2 months ago

I'm not sure I follow the logic for why retrying InsufficientVolumeCapacity is bad, isn't this an error that we explicitly expect to eventually go away (and thus should be retried)? I don't think we should deviate from the AWS SDK behavior unless we have a clear justification for why our use case is different from the average SDK consumer's case.

Regardless, I don't this is super relevant to the issue reported (just pretend the error that occurred is one we definitely want to retry), I think one of two things (or maybe both) likely happened:

  1. The logging containing the true error never happened or was swallowed somehow - that's obviously bad and we should fix if true
  2. The log message containing the true error does exist somewhere, but there's a flood of retry errors after it obscuring the true error - maybe we should up the default verbosity of retry-related errors?

@joemiller if you could confirm which version of the EBS CSI driver you're using as @AndrewSirenko asked above that would be great as we've revamped the retrying a few months ago which would hopefully reduce the spam of rate limit exceeded errors by doing client-side throttling when we detect them.

torredil commented 2 months ago

We use the defaults set by the EC2 SDK team for what errors are considered terminal.

Perfect, this answers my question @AndrewSirenko.

--

Thanks for your input as well, @ConnorJC3:

I'm not sure I follow the logic for why retrying InsufficientVolumeCapacity is bad, isn't this an error that we explicitly expect to eventually go away (and thus should be retried)?

You're right that retrying is appropriate. My concern isn't about whether to retry, but rather at which level the retry should occur.

I don't think we should deviate from the AWS SDK behavior unless we have a clear justification for why our use case is different from the average SDK consumer's case.

I generally agree that we shouldn't deviate from default behavior without good reason. Our use case is arguably different from the average SDK consumer's case. The average SDK consumer isn't working with multiple layers of retry mechanisms. The external provisioner already retries indefinitely using a backoff strategy to avoid hammering the API if we returned the error without retrying at the SDK level.

That said, I'm not advocating for deviating from default behavior or implementing a custom retrier to make an exception for this class of error. If we think or know that its fruitful to retry at the SDK level because an ICE event can be resolved before the RPC times out, then I'm cool with the existing logic.

The logging containing the true error never happened or was swallowed somehow - that's obviously bad and we should fix if true

+1, its plausible that the adaptive retrier swallows the initial error.