Instances: 4xx, 5xx and Connection timeout should be retriable (not terminal errors)

kubernetes-retired / service-catalog

Consume services in Kubernetes using the Open Service Broker API

https://svc-cat.io

Apache License 2.0

1.05k stars 385 forks source link

Instances: 4xx, 5xx and Connection timeout should be retriable (not terminal errors) #1715

Closed nilebox closed 6 years ago

nilebox commented 6 years ago

If the OSB request to the broker times out, Service Catalog executes the orphan mitigation and leaves the instance in the TerminalError status:

      message: 'readiness check failed: ErrorCallingProvision: Communication with
        the ClusterServiceBroker timed out; operation will not be retried: Put https://example.com/v2/service_instances/e00dfeb3-a3ce-4ec2
-b2bb-8f1232cc48cc?accepts_incomplete=true:
        net/http: request canceled (Client.Timeout exceeded while awaiting headers)'
      reason: TerminalError
      status: "True"
      type: Error

It means that if OSB broker was temporarily unavailable (or had some other temporary issue leading to slow request processing), Service Catalog won't retry provisioning. So to retry, the user needs to whether delete the instance and create it again, or mutate the spec (updateRequest++).

It's probably not the best UX. Shall we retry a certain number of times before giving up?

P.S. The scope of this issue is bigger, i.e. it also applies to 4xx and 5xx errors (with orphan mitigation required or without). The Kubernetes way of handling errors is to retry after failures (with exponential backoff).

pmorie commented 6 years ago

Agree this should not be a terminal error

On Sun, Feb 4, 2018 at 7:44 PM Nail Islamov notifications@github.com wrote:

If the OSB request to the broker times out, Service Catalog executes the orphan mitigation and leaves the instance in the TerminalError status:
  message: 'readiness check failed: ErrorCallingProvision: Communication with
    the ClusterServiceBroker timed out; operation will not be retried: Put https://micros--platform.ap-southeast-2.dev.atl-paas.net/osb/v2/service_instances/e00dfeb3-a3ce-4ec2
-b2bb-8f1232cc48cc?accepts_incomplete=true https://micros--platform.ap-southeast-2.dev.atl-paas.net/osb/v2/service_instances/e00dfeb3-a3ce-4ec2-b2bb-8f1232cc48cc?accepts_incomplete=true: net/http: request canceled (Client.Timeout exceeded while awaiting headers)' reason: TerminalError status: "True" type: Error

It means that if OSB broker was temporarily unavailable (or had some other temporary issue leading to slow request processing), Service Catalog won't retry provisioning. So to retry, the user needs to whether delete the instance and create it again, or mutate the spec (updateRequest++).

It's probably not the best UX. Shall we retry a certain number of times before giving up?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kubernetes-incubator/service-catalog/issues/1715, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWXmFy6TlKdOGrMfRK6N7_mxuJo4IGxks5tRk78gaJpZM4R4zGI .

nilebox commented 6 years ago

By looking at the code the current behavior was actually is by-design added by @kibbles-n-bytes:

// A timeout error is considered a terminal failure and we
// should initiate orphan mitigation.
if urlErr, ok := err.(*url.Error); ok && urlErr.Timeout() {
    msg := fmt.Sprintf("Communication with the ClusterServiceBroker timed out; operation will not be retried: %v", urlErr)
    readyCond := newServiceInstanceReadyCondition(v1beta1.ConditionFalse, reason, msg)
    failedCond := newServiceInstanceFailedCondition(v1beta1.ConditionTrue, reason, msg)
    return c.processProvisionFailure(instance, readyCond, failedCond, true)
}

// All other errors should be retried, unless the
// reconciliation retry time limit has passed.

I see 2 options for resolving this issue: 1) just remove this block and treat the connection timeout the same way as the other errors, i.e. as retriable straight-away and no orphan mitigation 2) leave the orphan mitigation for connection timeout and implement retry loop at the end of orphan mitigation.

@pmorie @kibbles-n-bytes what do you think is a better solution?

nilebox commented 6 years ago

After scanning through the OSB spec again, it seems that Option 2 is the only one that is considered valid by the spec.

The problem seems to be more serious than just connection timeout though. 400 Bad Request is also considered a terminal error, for example. And it certainly doesn't make sense to retry after receiving this error (unlike request timeout). But the current implementation of Service Catalog won't retry after receiving 400 Bad Request even if the ServiceInstance spec gets updated (which might include correct parameters).

So it looks to me that we should start a whole new process of categorizing errors that are currently considered "terminal". Do we even need "forever terminal" errors at all, i.e. not retrying even after the spec has changed?

/cc @ash2k @duglin @vaikas-google I would like to hear your thoughts as well.

nilebox commented 6 years ago

I took a look at the OSB provisioning errors, and what IMO should be changed, marked those in bold with asterisk (*):

Error	Orphan mitigation required?	Retriable automatically (after orphan mitigation if required)?	Retriable after `ServiceInstance` spec updated?
Timeout	Yes	Yes*	Yes*
4xx	No	No	Yes*
5xx	Yes	Yes*	Yes*

*The current behavior is "No", and should be changed to "Yes"

nilebox commented 6 years ago

Proposed changes in Service Catalog code to resolve this issue:

1. Move all non-terminal errors from Failed:True to Ready:False with Reason set (the only error that should be kept in Failed:True then is for 400 Bad Request which will require retry only after the spec has changed)

this allows to retry immediately without requiring to update the spec
this also means that we will have cases where we need to perform orphan mitigation before retrying

2. Always retry if the spec has changed (even if the Failed:True condition is set), i.e. no more "terminal" errors, just temporary errors that require fixing the spec, see #1751

addresses the issue for retrying after receiving 400 Bad Request and fixing the spec

3. Don’t set the ReconciledGeneration after orphan mitigation (i.e. if DeletionTimestamp == nil), or switch to ObservedGeneration as described in #1747

this allows to retry after orphan mitigation has succeeded

@pmorie @kibbles-n-bytes @jboyd01 @MHBauer @arschles please review the checklist above. Does it look reasonable to you?