Closed joemiller closed 1 month ago
Hey @joemiller thank you for this feedback.
Would you mind sharing your version of EBS CSI Driver? And any full controller logs that you still have?
The error you show is a retry-rate limit error, which might have obscured the EBS InsufficientVolumeCapacity
, and therefore might also obscure the fix in #2155. We will look into this on our side but ruling out an old driver version might be helpful.
In either case, yes you should have seen a clear error message pointing to insufficient Volume Capacity in the controller logs and we will work to fix this.
Thanks!
@AndrewSirenko Which class of errors are considered retryable by our createVolumeRetryer
?
The official docs describe InsufficientVolumeCapacity
as a terminal error by explicitly recommending to "try again later". Given that, it seems like it would be undesirable to go through the adaptive retrier when the API returns this error code.
I'm thinking it would make more sense to error out without retrying at the sdk level and letting the exponential backoff mechanism implemented in the provisioner do the needful, but would love to hear your thoughts on this.
We use the defaults set by the EC2 SDK team for what errors are considered terminal. This info can be found in the retryer package's documentation / code, because as of v1.35.0 we trust those defaults set by the EC2 team.
However, let's confirm the version of the driver, because our retry mechanisms have went through three changes (sdk v1, sdk v2, adaptive retry)?
Regardless of retries or terminal errors, real error EBS messages should be propagated to controller logs, which according to the issue might not have occurred.
The link to go AWS sdk v2 retry module docs; https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/aws/retry
(The retryable errors should be clearly listed in the AWS go sdk v2 repository as well if you grep for the error. I can do this as well once I'm back at my workstation)
I'm not sure I follow the logic for why retrying InsufficientVolumeCapacity
is bad, isn't this an error that we explicitly expect to eventually go away (and thus should be retried)? I don't think we should deviate from the AWS SDK behavior unless we have a clear justification for why our use case is different from the average SDK consumer's case.
Regardless, I don't this is super relevant to the issue reported (just pretend the error that occurred is one we definitely want to retry), I think one of two things (or maybe both) likely happened:
@joemiller if you could confirm which version of the EBS CSI driver you're using as @AndrewSirenko asked above that would be great as we've revamped the retrying a few months ago which would hopefully reduce the spam of rate limit exceeded errors by doing client-side throttling when we detect them.
We use the defaults set by the EC2 SDK team for what errors are considered terminal.
Perfect, this answers my question @AndrewSirenko.
--
Thanks for your input as well, @ConnorJC3:
I'm not sure I follow the logic for why retrying InsufficientVolumeCapacity is bad, isn't this an error that we explicitly expect to eventually go away (and thus should be retried)?
You're right that retrying is appropriate. My concern isn't about whether to retry, but rather at which level the retry should occur.
I don't think we should deviate from the AWS SDK behavior unless we have a clear justification for why our use case is different from the average SDK consumer's case.
I generally agree that we shouldn't deviate from default behavior without good reason. Our use case is arguably different from the average SDK consumer's case. The average SDK consumer isn't working with multiple layers of retry mechanisms. The external provisioner already retries indefinitely using a backoff strategy to avoid hammering the API if we returned the error without retrying at the SDK level.
That said, I'm not advocating for deviating from default behavior or implementing a custom retrier to make an exception for this class of error. If we think or know that its fruitful to retry at the SDK level because an ICE event can be resolved before the RPC times out, then I'm cool with the existing logic.
The logging containing the true error never happened or was swallowed somehow - that's obviously bad and we should fix if true
+1, its plausible that the adaptive retrier swallows the initial error.
On 2024-08-30 starting at roughly 1900 UTC and ending many hours later the
use1-az2
AZ was unable to provision newio2
volumes. The csi driver logs and events posted back to pending pods was generic, eg:It was not clear what the error was from these logs. This looked fairly generic to our responding engineers. It was not until someone thought to try creating a volume manually via the web console saw the error message
"There is currently insufficient capacity to serve the request. Try again at a later time or in a different availability zone"
I don't know if this error was being returned from the AWS API to the CSI driver or not. If so, it would be ideal to have this error propagated back to the pod/pvc events.