Closed jaymccon closed 5 years ago
@carolynvs @duglin @jboyd01 Would you mind to provide a review on this issue.
I'm ok with an invalid
/won't fix
response, then at least I know that I need to build in different handling for k8s/cf.
If it's something that you agree should be addressed, then I'm happy to work on a PR.
I actually think the current version of OSBAPI is broken. The 200 you're seeing on the last_op response just means that the polling worked ok, it has no bearing on whether the provision worked or not. Therefore whether we do orphan mitigation should be based on the status within the response body, similar to the HTTP response code in the sync provision case. In fact, see: https://github.com/openservicebrokerapi/servicebroker/pull/570/files#diff-958e7270f96f5407d7d980f500805b1bR1632 which tries to align the two.
You mentioned that CF acts correctly, meaning it doesn't do orphan mitigation. I actually think this might be a mistake from a consistency perspective because there are lots of other cases where it does try to clean up automatically. I believe the reason they don't is so that the user can try to debug the broker to see what's going on. A nice thought, but then they should do that in the sync cases too - and not just for provisioning but all orphan mitigation mandated cases. I would think they user might want to debug those cases too. IMO either the broker is a black box or not - the idea of twiddling that premise based on the OSBAPI op feels very odd to me. I think https://github.com/openservicebrokerapi/servicebroker/pull/598 tries to address some of that.
Net: let's clean up the OSBAPI spec first and then PR svc-cat to make it align. It sounds like you would like to see more cases where the platform doesn't automatically (immediately) do orphan mitigation - which is what I think PR 598 in OSB is trying to allow for, so you may want to join in that conversation to ensure it's headed the direction you'd like to see. Does this make any sense?
With the useful description I passed back in my response swallowed.
I didn't attempt to validate this, but I believe if you describe the failed instance you will see events with your broker provisioning error message. We may need to work on trickling this provision failure cause back to the user.
@duglin thanks for clarifying, I'll get involved in the discussion on openservicebrokerapi/servicebroker#598
Looking at this a bit further, I don't mind the orphan mitigation, assuming that errors are trickled up better. What I do need is a way to opt out of the retries that occur after OrphanMitigation. Is there any approach that you can think of that will allow me to signal to service catalog that the service instance is in a permanent failure state, so that retries don't happen ?
Initially I thought they were only happening under openshift, but am seeing the same retry behavior in 0.1.38
under kubernetes, and from what I can see the only way for me to stop service catalog from retrying is to return a 400
.
I believe if you describe the failed instance you will see events with your broker provisioning error message. We may need to work on trickling this provision failure cause back to the user.
Thanks @jboyd01, just checked this, error message is present in events with kubectl describe ServiceInstances
, but svcat describe instance blah
only contains the most recent message, which changes depending on what part of the provision/orphan mitigation retry loop ran last
@jaymccon D'oh! That's not helpful. Would you be up for opening an issue for svcat so that it does a better job of surfacing up all the error messages?
Based on my interpretation of the osb spec this is the appropriate response for a permanent failure.
@jaymccon the problem with the last operation is that there is no way to distinguish permanent failures from temporary. As a result, Service Catalog treats any failure of async operation as temporary. Having additional field to status: failed
would be helpful. /cc @duglin
yep- I really think https://github.com/openservicebrokerapi/servicebroker/pull/570 will help here.
@duglin, this is likely due to my lack of familiarity with the OSBAPI, but I fail to see anything in that PR (or anywhere in the spec) that deals with retrying provision requests. Could you clarify on how openservicebrokerapi/servicebroker#570 will help ?
See https://github.com/kubernetes-incubator/service-catalog/issues/2505#issuecomment-442658960 to see why I'm linking things - but perhaps I'm mixing stuff up
Maybe I should clarify from my end, I may have been a bit hasty in creating this issue without thorough testing and muddied the water. Let me know if you think I should close this and open a new issue so that it's less confusing.
What happened:
When an async provision fails with a permanent/unretryable error my broker returns a 200
response with a { "status": "failed", "description": "error message" }
payload.
Service catalog then keeps retrying for an undetermined amount of time (I tested for around an hour and it hadn't given up on retrying). Retry loop consists of:
200
{ "status": "failed", "description": "error message" }
Service Instance events:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning RetryBackoff 48m service-catalog-controller-manager Delaying provision retry, next attempt will be after 2018-11-29 18:29:47.307570229 +0000 UTC m=+235153.232052284
Warning RetryBackoff 48m service-catalog-controller-manager Delaying provision retry, next attempt will be after 2018-11-29 18:30:05.70774795 +0000 UTC m=+235171.632229977
Warning RetryBackoff 47m service-catalog-controller-manager Delaying provision retry, next attempt will be after 2018-11-29 18:30:26.307627592 +0000 UTC m=+235192.232109645
Warning StartingInstanceOrphanMitigation 47m (x4 over 48m) service-catalog-controller-manager The instance provision call failed with an ambiguous error; attempting to deprovision the instance in order to mitigate an orphaned resource
Normal Deprovisioning 47m (x4 over 48m) service-catalog-controller-manager The instance is being deprovisioned asynchronously
Warning ProvisionCallFailed 47m (x4 over 48m) service-catalog-controller-manager Provision call failed: The following resource(s) failed to create: [AWSSBInjectedIAMUser, AWSSBInjectedIAMUserRole, LoggingBucket]. . Rollback requested by user.
Warning RetryBackoff 47m service-catalog-controller-manager Delaying provision retry, next attempt will be after 2018-11-29 18:30:50.707616705 +0000 UTC m=+235216.632098768
Normal OrphanMitigationSuccessful 21m (x10 over 48m) service-catalog-controller-manager Orphan mitigation was completed successfully
Warning RetryBackoff 16m (x3 over 21m) service-catalog-controller-manager Delaying provision retry, next attempt will be after 2018-11-29 19:05:34.502052127 +0000 UTC m=+237300.426534182
Normal Provisioning 12m (x11 over 48m) service-catalog-controller-manager The instance is being provisioned asynchronously
Warning RetryBackoff 1m (x4 over 12m) service-catalog-controller-manager Delaying provision retry, next attempt will be after 2018-11-29 19:22:54.907944257 +0000 UTC m=+238340.832426270
What you expected to happen:
Service catalog performs orphan mitigation and then (without any retries) puts the ServiceInstance into a Failed state.
or
Service catalog provides brokers with a mechanism to indicate that a failure is permanent, thus avoiding the retry loop
How to reproduce it (as minimally and precisely as possible):
send a 200
response with a { "status": "failed", "description": "error message" }
payload to a LastOperationRequest
@duglin in this case problem is not whether to do orphan mitigation or not. As @jaymccon pointed above, the problem is what to do once orphan mitigation is finished - give up or retry.
@duglin @jboyd01 what do you think about treating state: failed
as permanent failure? i.e. we don't retry, and if the user wants to force retry he/she will do svcat touch instance
or make any other change in the spec? Would it break any of your use cases?
@jaymccon See https://github.com/kubernetes-incubator/service-catalog/pull/2520 that stops the retry loop. The orphan mitigation should still be the left in place IMO, otherwise we won't be able to safely retry provisioning after a user updates the parameters.
Awesome stuff @nilebox this PR will help us immensely
Bug Report
What happened:
When an async provision fails with a permanent/unretryable error I return a
200
response with a{ "status": "failed", "description": "error message" }
payload. Based on my interpretation of the osb spec this is the appropriate response for a permanent failure.service-catalog logs the following on this response:
my broker then receives a deprovision request, and resource ends up looking like:
With the useful description I passed back in my response swallowed.
What you expected to happen:
No orphan mitigation, service instance goes immediately into a failed state, surfacing my error message, something like:
How to reproduce it (as minimally and precisely as possible):
send a
200
response with a{ "status": "failed", "description": "error message" }
payload to aLastOperationRequest
Anything else we need to know?:
I wouldn't mind this so much if:
ambiguous
[1]Environment:
kubectl version
):