openservicebrokerapi / servicebroker

Open Service Broker API Specification
https://openservicebrokerapi.org/
Apache License 2.0
1.19k stars 435 forks source link

Should the platform clean up resources when an async provision/bind returns state failed? #573

Closed Samze closed 5 years ago

Samze commented 6 years ago

This PR clarified clean up scenarios when the broker return unexpected error codes. However in addition it added a new MUST:

If the successful response includes a `state` of `failed` then the Platform
MUST send a deprovision request to the Service Broker to prevent an orphan
being created on the Service Broker.

I see this type of cleanup as different to orphan mitigation. Orphan mitigation ensures the platform and broker have a common understanding of what resources exist (to avoid the broker having a resource that the platform is unaware of that a user is being charged for). In this case both the platform and broker knows about the failure. However the spec is dictating that the platform cleanup the resource rather it being triggered by a user - this is beyond orphan mitigation.

Since this MUST was added in 2.14, in CF we simply report the failure back to the user and then allow them to call deprovision in their own time. This for example opens up the ability to investigate the failure before the resources are cleaned up.

I guess this comes down to how much the platform "manages" the lifecycle of instances and what it does on behalf of the user. I wonder if this should be loosened to a MAY to allow platforms to have an opinion on this.

Additionally, technically this was breaking change in the last release as a new MUST was introduced and fixing it would be another breaking change.

duglin commented 6 years ago

In the CF case, if the user never issues the deprovision would CF eventually do it? The spec doesn't mandate when it happens, just that it does happen. I wonder if CF should just take it upon itself to do it after some grace period.

I'm concerned with a MAY because that just means no one can count on it - so people need to be prepared for it not to - which is about the same as the requirement not being there at all. So, we probably need to figure out what the preferred solution should be and then work back from that.

mattmcneeney commented 6 years ago

No, CF won't send the deprovision until/unless the user explicitly asks for it to happen with a cf delete-service.

I'm keen to ensure that operations teams have control of when resources are cleaned up after a failed provision in case they want to debug what actually failed under the hood.

Samze commented 6 years ago

The text originally didn't exist in the spec and so brokers will not have been counting on this behaviour from platforms - at least until the 2.14 release.

I'd be interested to hear more about the advantages to the broker of the guaranteed cleanup in this failure case.

duglin commented 5 years ago

Does CF has the notion of a dead service instance? Meaning "cf services" shows a list of services that the user can not do anything with except delete them?

duglin commented 5 years ago

More broadly, if we don't want to force orphan mitigation in this case, why do we want to force it in others? Couldn't the same argument be made that the Platform (or some admin) wants to debug stuff before the broker is forced to clean things up?

Samze commented 5 years ago

CF exposes the last operation information of a service instance, so the user can see that the instance “creation failed”. Being in this state does not actually limit what a user can do to the instance, so they could try and update it to a different plan for example (not sure this is actually by design or by accident).

More broadly, if we don't want to force orphan mitigation in this case, why do we want to force it in others?

I’ve always thought about orphan mitigation as a way to maintain that the “platform is the source of truth” rather than cleaning up resources as a convenience for the user. It should be used when things are in an “unknown state”.

In the async case the broker explicitly reports that the instance creation has failed via last operation. In the sync case then the failures are implied, a 5xx error or a timeout - the platform doesn’t really know what happened, either could have actually successfully created the instance behind the scenes.

I think the broker only needs to rely on knowing that in failure scenarios the platform is going to assume the worst and the mechanism for clean up is DELETE. Who triggers those deletes and when, seems to me to be a platform detail.

Generally the whole orphan mitigation section has constantly been a source of confusion. https://github.com/openservicebrokerapi/servicebroker/issues/617 https://github.com/openservicebrokerapi/servicebroker/issues/369 Where is that reset button?

Also regardless of what we choose here we should also address the current inconsistency between MUST and SHOULD https://github.com/openservicebrokerapi/servicebroker/issues/370