jaymccon commented 5 years ago

Bug Report

What happened:

When an async provision fails with a permanent/unretryable error I return a 200 response with a { "status": "failed", "description": "error message" } payload. Based on my interpretation of the osb spec this is the appropriate response for a permanent failure.

service-catalog logs the following on this response:

event.go:221] Event(v1.ObjectReference{...): type: 'Warning' reason: 'StartingInstanceOrphanMitigation' The instance provision call failed with an ambiguous error; attempting to deprovision the instance in order to mitigate an orphaned resource

my broker then receives a deprovision request, and resource ends up looking like:

$ svcat get instances
    NAME      NAMESPACE   CLASS    PLAN              STATUS
+-----------+-----------+-------+--------+----------------------------+
  test-fail   default     s3      custom   OrphanMitigationSuccessful

$ svcat get instance test-fail -o yaml
...
status:
  asyncOpInProgress: false
  conditions:
  - lastTransitionTime: 2018-11-28T21:44:25Z
    message: Orphan mitigation was completed successfully
    reason: OrphanMitigationSuccessful
    status: "False"
    type: Ready
  deprovisionStatus: Succeeded
  observedGeneration: 1
  orphanMitigationInProgress: false
  provisionStatus: NotProvisioned
  reconciledGeneration: 0

With the useful description I passed back in my response swallowed.

What you expected to happen:

No orphan mitigation, service instance goes immediately into a failed state, surfacing my error message, something like:

$ svcat get instance test-fail -o yaml
...
status:
  asyncOpInProgress: false
  conditions:
  - lastTransitionTime: 2018-11-28T21:44:25Z
    message: My error message I sent in the "description" of the LastOperationRequest response
    reason: Failed
    status: "False"
    type: Ready

How to reproduce it (as minimally and precisely as possible):

send a 200 response with a { "status": "failed", "description": "error message" } payload to a LastOperationRequest

Anything else we need to know?:

I wouldn't mind this so much if:

it didn't seem to be a violation of the osb spec:
- 200 seems to be the correct code, as my broker did successfully process the request, and is reporting back that there was a failure, so no need to keep polling, or consider my response ambiguous [1]
- Even if I've misinterpreted what the appropriate response code is, the spec states that platforms should not initiate orphan mitigation on 200 responses [2]
the error message sent back in the response was presented back to the user
some platforms (openshift 3.11 confirmed, possibly other versions) decide to retry provisioning when orphan mitigation occurs. Aside from the poor user experience of a loop with no obvious error being surfaced, if the underlying services being retried are costly and slow to provision, this can get expensive fast.
There was a consistent behavior between platforms, cloud foundry works as per my expectation, but if I send back a 400 (which at least stops orphan mitigation in service catalog), then it enters a retry loop and eventually fails with a generic error.

Environment:

Kubernetes version (use kubectl version):

EKS - v1.10.3-eks
OpenShift 3.11 - v1.11.0+d4cacc0

service-catalog version:

EKS - quay.io/kubernetes-service-catalog/service-catalog:v0.1.38
OpenShift - registry.access.redhat.com/openshift3/ose-service-catalog:v3.11.43-2

jaymccon commented 5 years ago

@carolynvs @duglin @jboyd01 Would you mind to provide a review on this issue.

I'm ok with an invalid/won't fix response, then at least I know that I need to build in different handling for k8s/cf.

If it's something that you agree should be addressed, then I'm happy to work on a PR.

duglin commented 5 years ago

I actually think the current version of OSBAPI is broken. The 200 you're seeing on the last_op response just means that the polling worked ok, it has no bearing on whether the provision worked or not. Therefore whether we do orphan mitigation should be based on the status within the response body, similar to the HTTP response code in the sync provision case. In fact, see: https://github.com/openservicebrokerapi/servicebroker/pull/570/files#diff-958e7270f96f5407d7d980f500805b1bR1632 which tries to align the two.

You mentioned that CF acts correctly, meaning it doesn't do orphan mitigation. I actually think this might be a mistake from a consistency perspective because there are lots of other cases where it does try to clean up automatically. I believe the reason they don't is so that the user can try to debug the broker to see what's going on. A nice thought, but then they should do that in the sync cases too - and not just for provisioning but all orphan mitigation mandated cases. I would think they user might want to debug those cases too. IMO either the broker is a black box or not - the idea of twiddling that premise based on the OSBAPI op feels very odd to me. I think https://github.com/openservicebrokerapi/servicebroker/pull/598 tries to address some of that.

Net: let's clean up the OSBAPI spec first and then PR svc-cat to make it align. It sounds like you would like to see more cases where the platform doesn't automatically (immediately) do orphan mitigation - which is what I think PR 598 in OSB is trying to allow for, so you may want to join in that conversation to ensure it's headed the direction you'd like to see. Does this make any sense?

jboyd01 commented 5 years ago

With the useful description I passed back in my response swallowed.

I didn't attempt to validate this, but I believe if you describe the failed instance you will see events with your broker provisioning error message. We may need to work on trickling this provision failure cause back to the user.

jaymccon commented 5 years ago

@duglin thanks for clarifying, I'll get involved in the discussion on openservicebrokerapi/servicebroker#598

Looking at this a bit further, I don't mind the orphan mitigation, assuming that errors are trickled up better. What I do need is a way to opt out of the retries that occur after OrphanMitigation. Is there any approach that you can think of that will allow me to signal to service catalog that the service instance is in a permanent failure state, so that retries don't happen ?

Initially I thought they were only happening under openshift, but am seeing the same retry behavior in 0.1.38 under kubernetes, and from what I can see the only way for me to stop service catalog from retrying is to return a 400.

jaymccon commented 5 years ago

I believe if you describe the failed instance you will see events with your broker provisioning error message. We may need to work on trickling this provision failure cause back to the user.

Thanks @jboyd01, just checked this, error message is present in events with kubectl describe ServiceInstances, but svcat describe instance blah only contains the most recent message, which changes depending on what part of the provision/orphan mitigation retry loop ran last

carolynvs commented 5 years ago

@jaymccon D'oh! That's not helpful. Would you be up for opening an issue for svcat so that it does a better job of surfacing up all the error messages?

nilebox commented 5 years ago

Based on my interpretation of the osb spec this is the appropriate response for a permanent failure.

@jaymccon the problem with the last operation is that there is no way to distinguish permanent failures from temporary. As a result, Service Catalog treats any failure of async operation as temporary. Having additional field to status: failed would be helpful. /cc @duglin

duglin commented 5 years ago

yep- I really think https://github.com/openservicebrokerapi/servicebroker/pull/570 will help here.

jaymccon commented 5 years ago

@duglin, this is likely due to my lack of familiarity with the OSBAPI, but I fail to see anything in that PR (or anywhere in the spec) that deals with retrying provision requests. Could you clarify on how openservicebrokerapi/servicebroker#570 will help ?

duglin commented 5 years ago

See https://github.com/kubernetes-incubator/service-catalog/issues/2505#issuecomment-442658960 to see why I'm linking things - but perhaps I'm mixing stuff up

jaymccon commented 5 years ago

Maybe I should clarify from my end, I may have been a bit hasty in creating this issue without thorough testing and muddied the water. Let me know if you think I should close this and open a new issue so that it's less confusing.

What happened:

When an async provision fails with a permanent/unretryable error my broker returns a 200 response with a { "status": "failed", "description": "error message" } payload.

Service catalog then keeps retrying for an undetermined amount of time (I tested for around an hour and it hadn't given up on retrying). Retry loop consists of:

Deprovision - Orphan Mitigation
Provision
Poll (My broker fails with same 200 { "status": "failed", "description": "error message" }

Service Instance events:

Events:
  Type     Reason                            Age                 From                                Message
  ----     ------                            ----                ----                                -------
  Warning  RetryBackoff                      48m                 service-catalog-controller-manager  Delaying provision retry, next attempt will be after 2018-11-29 18:29:47.307570229 +0000 UTC m=+235153.232052284
  Warning  RetryBackoff                      48m                 service-catalog-controller-manager  Delaying provision retry, next attempt will be after 2018-11-29 18:30:05.70774795 +0000 UTC m=+235171.632229977
  Warning  RetryBackoff                      47m                 service-catalog-controller-manager  Delaying provision retry, next attempt will be after 2018-11-29 18:30:26.307627592 +0000 UTC m=+235192.232109645
  Warning  StartingInstanceOrphanMitigation  47m (x4 over 48m)   service-catalog-controller-manager  The instance provision call failed with an ambiguous error; attempting to deprovision the instance in order to mitigate an orphaned resource
  Normal   Deprovisioning                    47m (x4 over 48m)   service-catalog-controller-manager  The instance is being deprovisioned asynchronously
  Warning  ProvisionCallFailed               47m (x4 over 48m)   service-catalog-controller-manager  Provision call failed: The following resource(s) failed to create: [AWSSBInjectedIAMUser, AWSSBInjectedIAMUserRole, LoggingBucket]. . Rollback requested by user.
  Warning  RetryBackoff                      47m                 service-catalog-controller-manager  Delaying provision retry, next attempt will be after 2018-11-29 18:30:50.707616705 +0000 UTC m=+235216.632098768
  Normal   OrphanMitigationSuccessful        21m (x10 over 48m)  service-catalog-controller-manager  Orphan mitigation was completed successfully
  Warning  RetryBackoff                      16m (x3 over 21m)   service-catalog-controller-manager  Delaying provision retry, next attempt will be after 2018-11-29 19:05:34.502052127 +0000 UTC m=+237300.426534182
  Normal   Provisioning                      12m (x11 over 48m)  service-catalog-controller-manager  The instance is being provisioned asynchronously
  Warning  RetryBackoff                      1m (x4 over 12m)    service-catalog-controller-manager  Delaying provision retry, next attempt will be after 2018-11-29 19:22:54.907944257 +0000 UTC m=+238340.832426270

What you expected to happen:

Service catalog performs orphan mitigation and then (without any retries) puts the ServiceInstance into a Failed state.

or

Service catalog provides brokers with a mechanism to indicate that a failure is permanent, thus avoiding the retry loop

How to reproduce it (as minimally and precisely as possible):

send a 200 response with a { "status": "failed", "description": "error message" } payload to a LastOperationRequest

nilebox commented 5 years ago

@duglin in this case problem is not whether to do orphan mitigation or not. As @jaymccon pointed above, the problem is what to do once orphan mitigation is finished - give up or retry.

@duglin @jboyd01 what do you think about treating state: failed as permanent failure? i.e. we don't retry, and if the user wants to force retry he/she will do svcat touch instance or make any other change in the spec? Would it break any of your use cases?

nilebox commented 5 years ago

@jaymccon See https://github.com/kubernetes-incubator/service-catalog/pull/2520 that stops the retry loop. The orphan mitigation should still be the left in place IMO, otherwise we won't be able to safely retry provisioning after a user updates the parameters.

jaymccon commented 5 years ago

Awesome stuff @nilebox this PR will help us immensely

kubernetes-retired / service-catalog

LastOperationRequest 200 response incorrectly triggers orphan mitigation #2505

Bug Report