openshift / origin

Conformance test suite for OpenShift
http://www.openshift.org
Apache License 2.0
8.5k stars 4.7k forks source link

Can't delete service instance from template broker if provision fails. #16515

Closed ncameronbritt closed 7 years ago

ncameronbritt commented 7 years ago

Error deprovisioning ServiceInstance "crash-2/mysql-ephemeral-df1hg" of ServiceClass "mysql-ephemeral" at ServiceBroker "template-service-broker": ""

Service instance yamls: while provisioning: https://gist.github.com/ncameronbritt/64ea769cc377fb83a76bc14c06e97b9f

after failure: https://gist.github.com/ncameronbritt/3d94410f8b546b0f3e8d9edf817ab4a3

after deleting: https://gist.github.com/ncameronbritt/2a125939ab8da92e0d4bad6a5b3cfd79

Controller manager logs: https://gist.github.com/ncameronbritt/64e230dddbbc01a93d7b647d5e1ab6c2

@bparees @pmorie @spadgett

Version

oc v3.7.0-alpha.1+3fa1968-489 kubernetes v1.7.0+80709908fd features: Basic-Auth

Server https://127.0.0.1:8443 openshift v3.7.0-alpha.1+50bb01b-594 kubernetes v1.7.0+80709908fd

Steps To Reproduce
  1. Provision template that will fail from Service catalog
  2. Wait for provision to fail
  3. Attempt to delete service instance
Current Result

Service instance is marked for deletion, but does not actually get deleted/removed from the ui.

Expected Result

Failed Service instance is actually deleted.

Additional Information

[try to run $ oc adm diagnostics (or oadm diagnostics) command if possible] [if you are reporting issue related to builds, provide build logs with BUILD_LOGLEVEL=5] [consider attaching output of the $ oc get all -o json -n <namespace> command to the issue] [visit https://docs.openshift.org/latest/welcome/index.html]

jim-minter commented 7 years ago

@pmorie I think this is a service catalog problem, please can you assign as required.

Issue 1: the error message "Error deprovisioning ServiceInstance"... is wrong. Reading the code, I think the SC is meaning to report that the provision has failed (which is indeed the case).

I see no evidence that the SC immediately attempts to deprovision as a result of the failed provision (nor am I saying that it should).

Issue 2: I am able to remove the failed provision via the web console, but this does not result in the SC calling the TSB to deprovision. As a result the related objects are not deleted by the TSB.

I don't know what the OSBAPI spec says in this case, but I think that fundamentally the SC should notify the broker in question when it wants to deprovision a failed provision.

pmorie commented 7 years ago

The OSB spec does 'orphan mitigation' if there is a communication error or timeout that means that the platform can't be sure if the provision succeeded or not. The broker is responsible for cleaning up after failed provisions that it knows about.

bparees commented 7 years ago

The OSB spec does 'orphan mitigation' if there is a communication error or timeout that means that the platform can't be sure if the provision succeeded or not.

pretty sure that is not the case here. we told the SC the provision failed.

Not sure how the SC would ever not know, since it's an async provision flow, so the SC can just keep calling lastoperation until you get a result.

bparees commented 7 years ago

(also how is the TSB supposed to know if the SC got a communication error?)

spadgett commented 7 years ago

@bparees @jim-minter My understanding from @pmorie is that the service catalog does not call deprovision by design if the provision fails. It's assumed that the broker will clean things up automatically on a failed provision and there's nothing for the service catalog to do.

jim-minter commented 7 years ago

I think that the spec is deficient in specifying what should happen here (I don't think the orphan mitigation piece covers it). What implemented behaviour do other brokers have?

jim-minter commented 7 years ago

So far I haven't found any indication that the Ansible broker self-deprovisions in the case of a provision failure. @jmrodri ? I've had a cursory look and haven't found any code implementing this in the sample brokers listed at https://github.com/openservicebrokerapi/servicebroker/blob/master/gettingStarted.md either. I don't think the spec is clear. I don't think it makes great sense from an API extensibility perspective for all brokers to unilaterally to deprovision upon a provision failure, and I don't think it's very pragmatic at this point to start to make them do it.

jmrodri commented 7 years ago

@jim-minter to the best of my knowledge we don't do any active orphan mitigation. If a provision request fails we return an appropriate error response (or a 500 if it's catastrophic) back to the service-catalog.

It was my understanding that the service-catalog would call deprovision on a failed attempt just to ensure that there were no lingering items left in the broker.

https://github.com/openservicebrokerapi/servicebroker/blob/master/spec.md#orphans

To mitigate orphan service instances and bindings, the marketplace SHOULD attempt to delete resources it cannot be sure were successfully created, and SHOULD keep trying to delete them until the service broker responds with a success.

And in terms of a failed provision, I'd assume this would be the case:

If the platform marketplace encounters an internal error provisioning a service instance or binding (for example, saving to the database fails), then it MUST at least send a single delete or unbind request to the service broker to prevent creation of an orphan.

pmorie commented 7 years ago

So, there is a gap in the spec when it comes to what the platform should do and what the broker should expect in this situation. We're working on resolving it - see https://github.com/openservicebrokerapi/servicebroker/issues/348.

pmorie commented 7 years ago

This should be fixed now that #17075 has merged. @ncameronbritt can you reopen this if you still see this behavior?

spadgett commented 7 years ago

I've also verified the fix. @ncameronbritt You'll need to pull the latest openshift/origin-service-catalog image when you cluster up to test.