kyma-project / infrastructure-manager

Apache License 2.0
0 stars 10 forks source link

Deal with recoverable errors in Gardener Shoot spec #369

Open tobiscr opened 2 months ago

tobiscr commented 2 months ago

Description

A Gardener can end with a recoverable error (e.g. quota exceeded). We have to define on Provisioner side how we treat these errors. Potential errors are (see Gardener docs) listed in the appendix section.

We have to verify which error codes KIM will treat as recoverable and leave the RuntimeCR in an Pending state. Other error codes will be treated as non-recoverable and the RuntimeCR will be set to status Error.

AC:

Appendix

Error code User error Description
ERR_INFRA_UNAUTHENTICATED true Indicates that the last error occurred due to the client request not being completed because it lacks valid authentication credentials for the requested resource. It is classified as a non-retryable error code.
ERR_INFRA_UNAUTHORIZED true Indicates that the last error occurred due to the server understanding the request but refusing to authorize it. It is classified as a non-retryable error code.
ERR_INFRA_QUOTA_EXCEEDED true Indicates that the last error occurred due to infrastructure quota limits. It is classified as a non-retryable error code.
ERR_INFRA_RATE_LIMITS_EXCEEDED false Indicates that the last error occurred due to exceeded infrastructure request rate limits.
ERR_INFRA_DEPENDENCIES true Indicates that the last error occurred due to dependent objects on the infrastructure level. It is classified as a non-retryable error code.
ERR_RETRYABLE_INFRA_DEPENDENCIES false Indicates that the last error occurred due to dependent objects on the infrastructure level, but the operation should be retried.
ERR_INFRA_RESOURCES_DEPLETED true Indicates that the last error occurred due to depleted resource in the infrastructure.
ERR_CLEANUP_CLUSTER_RESOURCES true Indicates that the last error occurred due to resources in the cluster that are stuck in deletion.
ERR_CONFIGURATION_PROBLEM true Indicates that the last error occurred due to a configuration problem. It is classified as a non-retryable error code.
ERR_RETRYABLE_CONFIGURATION_PROBLEM true Indicates that the last error occurred due to a retryable configuration problem. "Retryable" means that the occurred error is likely to be resolved in a ungraceful manner after given period of time.
ERR_PROBLEMATIC_WEBHOOK true Indicates that the last error occurred due to a webhook not following the Kubernetes best practices.

Reasons

React on error cases reported by Gardener and filter for recoverable / non-recoverable cases.

Attachments

Disper commented 5 days ago

Gardener exposes a helper method for detecting non-retryable errors. https://github.com/gardener/gardener/blob/539913d05582f88b80ea99cc53f2487aebaeeeab/pkg/apis/core/v1beta1/helper/errors.go#L181-L196

480 proposes a following split between retryables and non-retryables. I would like to point out especially the ERR_INFRA_RESOURCES_DEPLETED and ERR_CLEANUP_CLUSTER_RESOURCES where documentation neither suggests that they are retryable or not, but I decided to treat them as retryables as Gardener's HasNonRetryableErrorCode does not consider those two error codes.

Pasted_Image_08_11_2024__07_32