Closed nilebox closed 6 years ago
Agree this should not be a terminal error
On Sun, Feb 4, 2018 at 7:44 PM Nail Islamov notifications@github.com wrote:
If the OSB request to the broker times out, Service Catalog executes the orphan mitigation and leaves the instance in the TerminalError status:
message: 'readiness check failed: ErrorCallingProvision: Communication with the ClusterServiceBroker timed out; operation will not be retried: Put https://micros--platform.ap-southeast-2.dev.atl-paas.net/osb/v2/service_instances/e00dfeb3-a3ce-4ec2
-b2bb-8f1232cc48cc?accepts_incomplete=true https://micros--platform.ap-southeast-2.dev.atl-paas.net/osb/v2/service_instances/e00dfeb3-a3ce-4ec2-b2bb-8f1232cc48cc?accepts_incomplete=true: net/http: request canceled (Client.Timeout exceeded while awaiting headers)' reason: TerminalError status: "True" type: Error
It means that if OSB broker was temporarily unavailable (or had some other temporary issue leading to slow request processing), Service Catalog won't retry provisioning. So to retry, the user needs to whether delete the instance and create it again, or mutate the spec (updateRequest++).
It's probably not the best UX. Shall we retry a certain number of times before giving up?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/kubernetes-incubator/service-catalog/issues/1715, or mute the thread https://github.com/notifications/unsubscribe-auth/AAWXmFy6TlKdOGrMfRK6N7_mxuJo4IGxks5tRk78gaJpZM4R4zGI .
By looking at the code the current behavior was actually is by-design added by @kibbles-n-bytes:
// A timeout error is considered a terminal failure and we
// should initiate orphan mitigation.
if urlErr, ok := err.(*url.Error); ok && urlErr.Timeout() {
msg := fmt.Sprintf("Communication with the ClusterServiceBroker timed out; operation will not be retried: %v", urlErr)
readyCond := newServiceInstanceReadyCondition(v1beta1.ConditionFalse, reason, msg)
failedCond := newServiceInstanceFailedCondition(v1beta1.ConditionTrue, reason, msg)
return c.processProvisionFailure(instance, readyCond, failedCond, true)
}
// All other errors should be retried, unless the
// reconciliation retry time limit has passed.
I see 2 options for resolving this issue: 1) just remove this block and treat the connection timeout the same way as the other errors, i.e. as retriable straight-away and no orphan mitigation 2) leave the orphan mitigation for connection timeout and implement retry loop at the end of orphan mitigation.
@pmorie @kibbles-n-bytes what do you think is a better solution?
After scanning through the OSB spec again, it seems that Option 2 is the only one that is considered valid by the spec.
The problem seems to be more serious than just connection timeout though.
400 Bad Request
is also considered a terminal error, for example. And it certainly doesn't make sense to retry after receiving this error (unlike request timeout). But the current implementation of Service Catalog won't retry after receiving 400 Bad Request
even if the ServiceInstance
spec gets updated (which might include correct parameters).
So it looks to me that we should start a whole new process of categorizing errors that are currently considered "terminal". Do we even need "forever terminal" errors at all, i.e. not retrying even after the spec has changed?
/cc @ash2k @duglin @vaikas-google I would like to hear your thoughts as well.
I took a look at the OSB provisioning errors, and what IMO should be changed, marked those in bold with asterisk (*):
Error | Orphan mitigation required? | Retriable automatically (after orphan mitigation if required)? | Retriable after ServiceInstance spec updated? |
---|---|---|---|
Timeout | Yes | **Yes*** | **Yes*** |
4xx | No | No | **Yes*** |
5xx | Yes | **Yes*** | **Yes*** |
*The current behavior is "No", and should be changed to "Yes"
Proposed changes in Service Catalog code to resolve this issue:
1. Move all non-terminal errors from Failed:True
to Ready:False
with Reason
set (the only error that should be kept in Failed:True
then is for 400 Bad Request
which will require retry only after the spec has changed)
2. Always retry if the spec has changed (even if the Failed:True
condition is set), i.e. no more "terminal" errors, just temporary errors that require fixing the spec, see #1751
400 Bad Request
and fixing the spec3. Don’t set the ReconciledGeneration
after orphan mitigation (i.e. if DeletionTimestamp == nil
), or switch to ObservedGeneration
as described in #1747
@pmorie @kibbles-n-bytes @jboyd01 @MHBauer @arschles please review the checklist above. Does it look reasonable to you?
If the OSB request to the broker times out, Service Catalog executes the orphan mitigation and leaves the instance in the TerminalError status:
It means that if OSB broker was temporarily unavailable (or had some other temporary issue leading to slow request processing), Service Catalog won't retry provisioning. So to retry, the user needs to whether delete the instance and create it again, or mutate the spec (
updateRequest++
).It's probably not the best UX. Shall we retry a certain number of times before giving up?
P.S. The scope of this issue is bigger, i.e. it also applies to 4xx and 5xx errors (with orphan mitigation required or without). The Kubernetes way of handling errors is to retry after failures (with exponential backoff).