jboyd01 commented 6 years ago

Purpose

Provide a mechanism for brokers to register callbacks (webhooks) that can be used for validation prior to the platform attempting to Create, Update or Delete (CUD) a Service Instance or Binding.

There will likely be a lot of discussion on the actual implementation details, initially this proposal will just focus on surfacing the issue and proposing the use of pre-action validation so Brokers have an opportunity to indicate to the platform that an action will or will not be accepted for processing. Once the SIG has discussed and given general agreement, we'll drill into a detailed design.

This feature allows a broker to register webhooks for precheck validation for Instances and Bindings. That is, if indicated by the broker, the Platform will invoke a validating webhook just prior to invoking the actual call to create, update or delete an Instance or Binding. The webhook will be invoked with the same parameters and payload as the actual create/update/delete operation, but this operation is a dry-run for Broker validation only. The broker may accept the request and respond with a status 200 OK or may return an error matching the same set of possible returns codes as is documented for the actual operation. At this time, the validation must be synchronous.

Platforms are not required to execute precheck validations - if the platform does not, it is expected the CUD operation will fail in the same manner as the validation would have failed.

Rationale

Some platforms (Kubernetes being one) create platform metadata that represents the Service Instance and Service Binding. In many cases these objects are created or updated prior to invoking Broker endpoints to create, update or delete the Broker resource. If the Broker encounters an error or rejects the request the platform usually leaves the platform object in an error or modified state that no longer reflects the Broker's resource and requires the user or administrator to clean it up or roll it back. Sometimes the platform and user can not recover and are unable to roll back the object to its prior state. The addition of the precheck allows the platform to fail the user's request early before any platform metadata objects are modified that would have have otherwise required cleanup.

Scenario 1 (Instance Update)

User A's update operation will fail because of some ACL or other business policy check. Prior to validation webhooks, the platform's metadata object is updated with the end user's changes but the update operation is marked as failed. If user B then attempts a different update on the object by editing the current object, he'll be working with user A's change as well. If user B doesn't realize this and commits the change, the broker will see the full set of updates and will assume these updates are being requested by user B and if he has privileges, an unintended update may be executed. The precheck adds the ability to verify A's access and other business checks prior to making any changes to the platform metadata object. With prechecks in place, the platform metadata object would never be updated if user A's precheck update operation failed.

Scenario 2 (Instance Deletion)

Today without validating webhooks, when a user deletes an instance the platform metadata object is irreversibly marked as deleted - even if the Broker rejects the delete, the platform object was marked for deletion and this can not be undone. Eventually the Platform resource must be deleted and re-created if necessary - this entails someone with the required privileges actually deleting the resource with the Platform command, waiting for the Broker to also successfully delete its resources, and then recreating the Instance and any associated Bindings. With a validating precheck, the platform checks early to see if the broker is going to reject the operation and if so, the platform can abort the operation and present an appropriate error message to the end user. In the latter case with the validation giving the error, nothing has been changed on the platform side; no cleanup is necessary.

High Level Design

1) In the Broker Catalog response, the Broker indicates if it wants precheck validations invoked for Instances and Bindings. New optional Response Fields for /v2/catalog:

Response Field	Type	Description
supports_validating_instance_operations	boolean	Specifies whether the Broker supports dry-run validation of Create/Update/Delete operations on Service Instances.
supports_validating_binding_operations	boolean	Specifies whether the Broker supports dry-run validation of Create/Update/Delete operations on Service Bindings.

2) If the Broker wants precheck validation, it must accept validation invocations at /v2/service_instances/validate:instance_id and /v2/service_instances/:instance_id/service_bindings/validate:binding_id. The validation requests are formed exactly as original target operation except the platform must not specify accepts_incomplete=true. Otherwise the request has identical request headers, parameters, body, and invocation method. I.E., to validate deleting a Binding, the platform will execute a HTTP DELETE against /v2/service_instances/:instance_id/service_bindings/validate:binding_id

TBD/Needs investigation:

Red Hat's Automation Service Broker would ideally do all its validation in a new pod that would have to be spun up requiring async support on the validation, there may be others in this category wanting async support, need to investigate options here.
Is a Precheck rejection considered temporary or permanent? See issues around namespace/mass deletion raised by edbartsch

edbartsch commented 6 years ago

What would happen in case of mass deletions like deletion of CF space: Would it mean that a service broker will be able to prevent deletion of a space? Would CF first ask all service brokers whether deletion is possible and only then start sending first deletion requests? What if one service broker says "no" because there are dependent service instances that have to be deleted first? But if those service instances are deleted first during CF space deletion the service broker would say "yes"? May be, the deletion check should include a mass deletion flag that indicates which hierarchy level is actually being deleted (e.g. CF Org, CF Space, single instance)

mattmcneeney commented 6 years ago

Hey @jboyd01. You mentioned yesterday that if a broker says "yes you can delete this instance", but then the instance still fails, then service-catalog are planning on implementing something to provide an adequate UX in this scenario. This seems fairly likely since the validate request is really just a "best guess" by the broker. What happens today in this case and how are you planning to handle this scenario?

If a deprovision request fails in CF, then the user has a few options: 1) Then can try again, in case it was a flake, and they could try to debug what happened if they wanted to 2) If they got some meaningful error message back (such as the instance needs to be updated to a smaller plan before being deleted), then they can go and take some action 3) If they never want to use the instance again (and don't care about leaving resources hanging around behind the broker), they can purge the service instance In all of these cases, I'm not sure what we gain from finding out whether the deprovision request is likely to fail in advance (may not be truthful).

jboyd01 commented 6 years ago

@edbartsch wrote:

What would happen in case of mass deletions like deletion of CF space: Would it mean that a service broker will be able to prevent deletion of a space?

Yes, but that's already the case today without the precheck but on the Delete operation itself. If the delete is rejected, the namespace sits in a "terminating" state until the instances and bindings are removed. I don't believe we'd see any different behavior on this one: the delete instance request comes in, the platform executes the precheck validation and it told to fail by the Broker, the platform then tells CF or Kubernetes that the delete was rejected. With Kubernetes, when you delete a namespace, the namespace is set to "terminating" and will stay that way until the instances and bindings are successfully deleted.

What if one service broker says "no" because there are dependent service instances that have to be deleted first? But if those service instances are deleted first during CF space deletion the service broker would say "yes"?

I think you are suggesting that if you have two services instances such as A and B and A can't be deleted until B is, the mass deletion should result in retrying the deletion on A once B is removed, do I have this correct? I'll have to experiment with this in Kuberenetes. You raised a good angle to consider here - thanks! I'll add this scenario as we move forward.

jboyd01 commented 6 years ago

@mattmcneeney: This is a recognized pain point - we have a github issue that reports the problem and the maintainers agree we need to address it, there has been discussion along with some potential ideas thrown around, but we do not have an agreed upon solution yet. Whatever actions we take around this will likely be multipronged, each addressing a specific scenario.

Your three options look to be similar to our choices in Kuberenetes. I'd say # 2 is likely the first action users will take (review the error returned by the broker). # 3 is presently a manual process for our users involving editing the resource and removing the Service Catalog "finalizer" which allows deletion of the Service Catalog object without doing the deletion at the Broker. We've got a PR to add this "nuclear" or "abandon" command to make this easier for the user while stressing to the user this may result in Broker resources they will continue to be charged for.

The prechecks alleviate the pain for the class of issues that result in denial from ACL or Business Logic validation. I agree this isn't a magic bullet, but it addresses a sizable segment of the issues.

duglin commented 6 years ago

... but we do not have an agreed upon solution yet

I thought svc-cat's vote ended the discussion of alternative solutions - am I mistaken?

I'd say # 2 is likely the first action users will take (review the error returned by the broker)

But in Kube we don't allow them to do anything about it (like downgrade the plan, per matt's suggestion) and that's a big difference.

I agree this isn't a magic bullet, but it addresses a sizable segment of the issues

I'd like some data behind this assertion because I would claim the opposite. From an OSBAPI perspective, a new operation like this can cause more harm than good. Since there is no clear definition of what the broker is supposed to do or what kind of guarantee the response will have associated with it (each broker/instance will have its own list of checks it can do w/o actually performing the ultimate operation), so its ambiguous w.r.t what the Platform can assert from a "yes" (or "no") response. Any response from a check type of operation might change right after the operation is completed, which means the Platform can't actually make any definitive decision based on it - the Platform must always be prepared to deal with the real operation saying "no". A "yes" from the check is virtually meaningless in the face of this reality.

The only way for the Platform to know for sure about the results of the real operation is to actually attempt it - anything else is just a guess and therefore the platform taking any actions based on it is questionable at best, and perhaps dangerous at worst if $ or serious business logic is at stake (e.g. being given the wrong answer and being forced to prematurely delete a resource due to it). From an OSBAPI spec perspective, this new operation would provide no additional value over just calling the real operation but would open the door for a "guess" type of operation which is not appropriate for a formal API spec.

Again, this is just from a straight OSBAPI spec perspective.

tinygrasshopper commented 6 years ago

@jboyd01 A thought about rationale for the change. I think the underlying driver for this proposal boils down to the platform not having the ability to change the state of an object back to its original version(pre update/pre delete), if the operation fails on the broker. To solve this issue, the proposal adds a pre-check before the platform does the operations. Rather than adding an additional pre-check hook, would it possible for the platform to call the actual operation(broker update/delete) in the precheck validating webhook?

duglin commented 6 years ago

As a side note... if we did add an op like this we'd need to support an async version of it since the processing behind the check could be complicated and time consuming - this is pretty much why we added async versions to most of the OSBAPI ops. This means that K8s would need to support an async call to the op, which I'm not sure is possible given the synchronous nature of how K8s is expected to deal with webhooks and the timeout constraints on the original request coming into the K8s apiserver.

MHBauer commented 6 years ago

I think this adds complexity without providing a 100% solution. I know that seems like a high bar, but if we already have to have a recovery case, and after still have to have a recovery case, then I would choose the solution with less complications and the recovery case. We cannot make this required to implement, so we must hope future brokers see the value in adding it.

I think we need to ask ourselves "Does this result in enough fewer problems that it is worth the extra complication of implementing and running and maintaining?"

I've tried to write up the cases so we can see everything at once: Any time you see "Do delete" that is the point of no return for kube.

existing cases

Delete success
- Do delete
- delete succeeds
Delete Failure
- Do delete
- delete fails, enter recovery

New Cases are 1-3, 4&5 are the original cases

successful precheck
- Do precheck
- precheck succeeds
- Do delete
- delete succeeds
failed precheck
- Do precheck
- precheck fails
successful precheck, delete fails anyway
- Do precheck
- precheck succeeds
- Do delete
- delete fails, enter recovery
precheck not supported, delete succeeds
- Do precheck
- precheck not supported
- Do delete
- delete succeeds
precheck not supported, delete fails
- Do precheck
- precheck not supported
- Do delete
- delete fails, enter recovery

MHBauer commented 6 years ago

@tinygrasshopper https://github.com/openservicebrokerapi/servicebroker/issues/572#issuecomment-412871687 It is possible to do that, but it carries the potential of losing the requested update, while still carrying it out. The built in pre-check happens before persistence, so there is no guarantee the pre-checker doesn't crash after receiving a client request and before making the call, or after making the call but before persisting the deletion being attempted.

MHBauer commented 6 years ago

https://github.com/openservicebrokerapi/servicebroker/issues/572#issuecomment-411320869 I think the concerns from @edbartsch sound similar to problems we've had deleting namespaces in kubernetes, where the namespace deletion is hung-up on deleting our service-catalog resources.

jboyd01 commented 6 years ago

@duglin, in response to https://github.com/openservicebrokerapi/servicebroker/issues/572#issuecomment-412704585

... but we do not have an agreed upon solution yet I thought svc-cat's vote ended the discussion of alternative solutions - am I mistaken?

This was in specific reference to the scenario where despite the validating webhook passing, if the broker encounters some error, we could end up with a failed delete - how do we deal with these? We've briefly discussed (and put on hold) elevating the issue to an admin "queue" , effectively moving the issue out of the user's domain and making it an issue the admin has to deal with. There may be other options we could pursue here.

I agree, the svc-cat vote ended discussions about changing the way Kubernetes delete works.

duglin commented 6 years ago

+1 to @MHBauer's comments about how no matter what Kube will need to deal with what he calls "enter recovery". Platforms using a pre-check op will not avoid this need, so adding something to our spec which doesn't actually solve the problem (but adds complexity) doesn't make a lot of sense to me.

edbartsch commented 6 years ago

If we view this proposal as precheck done by the platform before doing the actual operation, it is not making much sense, I fully agree.

But we may adjust the proposal in a following way: Instead of going for platform (CF or Kube) doing following

Do precheck of some operation
precheck succeeds
Do some operation
Operation succeeds

we may go for

Do precheck of some operation with some parameters
precheck fails
Let human being or script analyse the sutuation (i.e. the error message returned)
Do precheck of some operation with some other parameters
precheck succeeds
Do operation
Operation succeeds

This would mean for example for CF that new commands / cloud controller APIs to be introduced like:

cf check-create-instance
cf check-delete-instance
cf check-update-instance

duglin commented 6 years ago

@edbartsch given that we have the ability for a broker to return the schema of the parameters, I would think that would cover most of the trivial checks that might be needed, which means it can be done by the platform itself w/o talking to the broker. And for cases where a simple schema check isn't sufficient, it seems like the number of times people will want to ask "what if?" on an operation and not actually attempt it is pretty small. Plus, add into that the fact that its (at best) a guess, I just don't see the value in adding the complexity when I doubt anyone will use it for anything expect as a toy operation.

But, ignoring all of that, since its pretty subjective, regardless of whether or not we add this operation, the Platform still needs to deal with the situation of the real operation failing, and allowing the end user to continue to use the resource in question - at least for 4xx errors. Given that the main driving force behind this proposal is to try to specifically not deal with this situation in the proper way, it shifts from something I think isn't very useful to something that I consider harmful to the spec's adoption and the Platform's users.

mattmcneeney commented 6 years ago

@edbartsch From a CF perspective, I think it is very unlikely that we'd add commands like cf check-create-instance. This workflow doesn't exist for any other commands (i.e. cf check-push) and would make the self-service model we have built for services less intuitive. Also, since users are very used to simply running cf create-service and then dealing with errors as and when they occur (i.e. missing configuration parameters), even if we added these commands, it is doubtful they would ever be used.

We could add logic to the Cloud Controller to automatically get an indication of whether or not the real request would succeed or fail, but I don't believe we would get any benefit from writing that additional logic.

edbartsch commented 6 years ago

@mattmcneeney The checking REST calls are mainly interesting for humans and user interfaces as they introduce a non-destructive way to check validity of operations to be executed. JSON schema definition for parameters covers only a small subset so that there is still a room for delegation of complex checks to service brokers (e.g. "Are certain combinations of parameters applicable to the current service instance?"). I agree that "cf check-create-instance" is not really necessary as the fresh but undesired instance can always be deleted right after creation. "cf update-instance" is a different beast. If you are an operator of a database instance that manages company data that is worth of billions of dollars, that last thing you want to do during instance update is to play russian rulette with the instance parameters. Any additional check is welcome.

But for every problem there are usually multiple options how to solve it. https://github.com/openservicebrokerapi/servicebroker/pull/570 solves the requirements of this proposal is a bit different way (by allowing brokers to say NO during operation execution).

Therefore, I agree with the others that this proposal (572) is going into a wrong direction and should be rejected because the pull reqeust 570 solves the requirements but in a different way .

pmorie commented 6 years ago

@tinygrasshopper

Rather than adding an additional pre-check hook, would it possible for the platform to call the actual operation(broker update/delete) in the precheck validating webhook?

This isn't the right course of action for Kube; admission validating webhooks are meant to give users of Kubernetes (and people that build software using Kubernetes extension mechanisms) a chance to approve or reject an operation on a Kubernetes resource before that change is accepted. Calling the actual operation in a webhook isn't a fit for the state reconcilation concept, in that Kubernetes controllers reconcile the accepted declarative state from the user. The problem this is intended to solve is allowing brokers to reject a request that should not be accepted into the system. Today, there is no mechanism we can use in Kubernetes to validate that a particular change to a ServiceInstance resource, for example, is actually allowed. This proposal is meant to provide such a mechanism that will allow the Kubernetes catalog to reject changes to service-catalog resources that would place them in a state where the broker would be known to fail. For example, if an update to a ServiceInstance is made that would result in an invalid combination of plan and parameters, the broker would have a chance to reject that update before it is accepted into a Kubernetes API. Kubernetes APIs have the characteristic that they represent the user's desired state for the system to be in. By accepting a particular state of a resource into Kubernetes, the system has told the users that their desired state is acceptable. Therefore, you can think of the proposed change as allowing brokers to participate in API validation and provide feedback to the user before the system has accepted an invalid state.

nilebox commented 6 years ago

IMO the URL pattern /v2/service_instances/validate:instance_id is not RESTy, I would rather propose either /v2/service_instances/instance_id/validate or /v2/service_instances/validate/instance_id

Another alternative option is to add a "dry run" mode to normal create/update/delete OSB requests instead. This is aligned with a Kubernetes feature https://github.com/kubernetes/kubernetes/issues/11488 (that is also supported on the backend, but hasn't been fully implemented yet). For example, DeleteOptions now has a flag for dry run: https://github.com/kubernetes/apimachinery/blob/667bdebd3f319f83b02daaa5aef5d9b9a2518c2c/pkg/apis/meta/v1/types.go#L463-L469 same applies to CreateOptions and DeleteOptions.

This could be dangerous in case OSB broker ignores the dryRun flag and applies changes instead though...

duglin commented 6 years ago

Per today's call, will put this one on-hold until @jboyd01 indicates its ready for us to resume our discussions.

koznov commented 2 years ago

@duglin @jboyd01 Is there any news regarding adding that functionality of validating WebHooks for CRUD?

openservicebrokerapi / servicebroker

Proposal: Validating WebHooks for Create/Update/Delete of Instances & Bindings #572

Purpose

Rationale

Scenario 1 (Instance Update)

Scenario 2 (Instance Deletion)

High Level Design

TBD/Needs investigation: