kubernetes-retired / service-catalog

Consume services in Kubernetes using the Open Service Broker API
https://svc-cat.io
Apache License 2.0
1.05k stars 385 forks source link

Add imperative deprovision/unbind flag to ServiceInstance/ServiceBinding #1942

Closed kibbles-n-bytes closed 5 years ago

kibbles-n-bytes commented 6 years ago

As deletion of k8s resources is permanent once it has been accepted by the API server, we currently get into a strange state where if a deprovision/unbind request fails at the broker, our k8s resource will be stuck permanently in the failed deletion state.

At the April F2F we decided to add deprovision/unbind flags to allow the user to attempt a deprovision/unbind without having to delete the Service Catalog k8s resource.

n3wscott commented 6 years ago

What about the state where I just want that object out of my k8s resource space and I don't care that the broker says I can't delete it... maybe it is a borrowed resource or something?

kibbles-n-bytes commented 6 years ago

@n3wscott In that case, you can remove the resource today by doing a kubectl edit and removing the Service Catalog finalizer after having done a delete request.

pmorie commented 6 years ago

I'm kind of hesitant about doing this for a few reasons:

  1. It is fundamentally different than what users will expect from other parts of kube
  2. What happens after you set this field? Can you update the plan? Can you change the parameters?
  3. What happens if you set this field and then delete the resource while async deprovision is happening?
  4. How will users tell that an instance has been deprovisioned but not deleted?
  5. Will users be able to create new bindings to deprovisioned instances?
  6. Will you be able to create a new ServiceInstance with this field set?
  7. We will still have to support the exact same behavior as we do currently, so you will still be able to trigger a deprovision by deleting the instance. How will we tell users which option to choose?

I don't think any of the use-cases we had for this or motivation made it into this issue. Before we proceed on this, I think we need to understand exactly why we feel like we need to make a fairly significant change to the API and be very crisp about when to use this feature. As an example: if you were just to look at this issue currently, it's not really clear what specific situations this is intended to resolve. Is it when parameters are in a bad state? When an invalid plan is selected?

On a longer term basis, I'm very concerned about adding things to our API that don't behave as you would expect them to from using other parts of kube. IMO we should be moving OSB to be more compatible with kube, not the other way around.

kibbles-n-bytes commented 6 years ago

It is fundamentally different than what users will expect from other parts of kube

It's possible to model this declaratively. An existsExternally bool that defaults to true would allow the user to declare their intent as to whether the resource should be persisted on the broker, and the controller would reconcile this intent with the state of the world, provisioning/deprovisioning as necessary.

What happens after you set this field? Can you update the plan? Can you change the parameters?

You can, but presumably the deprovision operation would take precedent. And while the resource is marked as !existsExternally, the spec updates wouldn't try anything on the broker. Reminds me a bit of paused on Deployments?

What happens if you set this field and then delete the resource while async deprovision is happening?

The API server would set the deletion timestamp, which the controller would notice after async deprovision finishes, and it'll go through the normal delete flow which will just remove the finalizer, just like the case where we delete an unprovisioned instance.

How will users tell that an instance has been deprovisioned but not deleted?

There would be no ExternalProperties, and our Provisioned value in our status would be NotProvisioned.

Will users be able to create new bindings to deprovisioned instances?

They could create the k8s resource, but our controller wouldn't create them on the broker because the instance wouldn't be in the Ready state.

Will you be able to create a new ServiceInstance with this field set?

Sure; the controller just wouldn't provision it and it'd act like a paused Deployment.

We will still have to support the exact same behavior as we do currently, so you will still be able to trigger a deprovision by deleting the instance. How will we tell users which option to choose?

If the k8s user wants the option to still use the instance after having requested a deprovision on the offchance they either didn't have permission to or want to revive the resource later on the broker-side, they should set !existsExternally. If they are sure they no longer want the k8s resource, they should request a delete, which they understand to be 100% fatal to the resource.

kibbles-n-bytes commented 6 years ago

As for motivations, @duglin could you describe your situation here to give some context?

duglin commented 6 years ago

I think @kibbles-n-bytes said it well in his opening comment:


As deletion of k8s resources is permanent once it has been accepted by the API server, we currently get into a strange state where if a deprovision/unbind request fails at the broker, our k8s resource will be stuck permanently in the failed deletion state.


This is to deal with cases where the Broker rejects the delete. At some point we need to allow the system to get back into a pre-delete-request state but Kube won't let us do that if we start down the true delete() path.

nilebox commented 6 years ago

I agree with @pmorie that this is a strange feature. If Broker can't delete the resource, or got stuck, it's broker's problem. There is no way back, and there is no guarantee that you can just pretend that you never tried to delete resources - Broker might have already deleted some data.

Also, what about cascade deletion? We use ownerReference to build a dependency graph. When user sends a delete request with foreground deletion (will be default for kubectl in the next release, see https://github.com/kubernetes/kubernetes/pull/59851 , GC will start with the leaves and then to the parent. How will it work with delete field? I'm concerned that we go against the Kubernetes nature there. Kubernetes doesn't support "undeleting" for a reason, and such generic service as Service Catalog cannot guarantee that "undeleting" will always be possible and safe.

nilebox commented 6 years ago

I can think of safe "undelete" under certain conditions:

If we limit the scope for delete flag like this, it could be tolerable. But I'm still concerned that we go in the non-kube direction, as @pmorie already said.

ash2k commented 6 years ago

This is to deal with cases where the Broker rejects the delete. At some point we need to allow the system to get back into a pre-delete-request state but Kube won't let us do that if we start down the true delete() path.

@duglin Could you elaborate what "rejects" means? Is it because of some sort of authorization? If so, the problem is that that authorization happens after the sate, desired by the user, has been persisted in the api server. It needs to happen BEFORE state is persisted. Kubernetes provides a mechanism to do that - admission hooks. What is the reason it cannot be used?

The "Authn -> Authz -> State is persisted -> Authz" flow just does not work. I expect more issues with this approach. Spec locking, "imperative deprovision" - this is not how other resources work in Kubernetes. Please don't do this to us, Service Catalog users. We want consistent behaviour across all Kuberentes resources.

Another issues with this "get back to good state after a delete" is that AFAIK OSB spec does not guarantee that an instance is in a valid state after a failed delete. Basically this "imperative deprovision" cannot be used with an arbitrary broker without modifying the OSB spec. Am I mistaken here?

Have you discussed this issue with someone from SIG-Architecture? I strongly recommend that.

MHBauer commented 6 years ago

https://github.com/kubernetes-incubator/service-catalog/issues/1942#issuecomment-391504929

If Broker can't delete the resource, or got stuck, it's broker's problem. There is no way back, and there is no guarantee that you can just pretend that you never tried to delete resources - Broker might have already deleted some data.

In this case the broker won't have done anything. Kube is stuck in deleting even though it should not be deleted.

GC

Should not be in effect here until the controller issues the actual DELETE, at which point everything should occur as normal.

https://github.com/kubernetes-incubator/service-catalog/issues/1942#issuecomment-391510865

Conditions

I will have to think about these. Potential.

https://github.com/kubernetes-incubator/service-catalog/issues/1942#issuecomment-391538651

what "rejects" means? Is it because of some sort of authorization?

Yes. Authorization to access the broker.

admission hooks. What is the reason it cannot be used?

I proposed this at the F2F and was shot down in favor of this direction. The main problem here is that there is no preflight check to OSBAPI. The admission controller would have to actually do all of the deletion actions, and then do persistence.

"Authn -> Authz -> State is persisted -> Authz" flow just does not work.

I think this hits at the very core of the issue. Unfortunately, once the state of DELETE is persisted, it cannot be un-persisted. Thus, we must find a workaround.

AFAIK OSB spec does not guarantee that an instance is in a valid state after a failed delete. Basically this "imperative deprovision" cannot be used with an arbitrary broker without modifying the OSB spec. Am I mistaken here?

Reading https://github.com/openservicebrokerapi/servicebroker/blob/v2.13/spec.md#originating-identity

I am unsure if there is a specific HTTP code to use as a result of this failure.

Reading https://github.com/openservicebrokerapi/servicebroker/blob/v2.13/spec.md#authentication

Maybe 401

Reading https://github.com/openservicebrokerapi/servicebroker/blob/v2.13/spec.md#unbinding

Maybe 400

Reading https://github.com/openservicebrokerapi/servicebroker/blob/v2.13/spec.md#deprovisioning

Maybe 400

Have you discussed this issue with someone from SIG-Architecture? I strongly recommend that.

No.

nilebox commented 6 years ago

Maybe 400 (deprovisioning)

Can you think of any real situation where this might happen, assuming that Service Catalog has no bugs, and OSB broker is implemented correctly according to the spec? Service Catalog just sends instance_id (along with service_id and plan_id), what can possibly be incorrect in this request? If a deprovision request fails with 400, to me it's a sign of either a bug in OSB broker, or some unknown state of the underlying resource. And in that case getting stuck in terminal deletion state is the best you can do - user can't fix this.

403 Forbidden is a more adequate rejected result for deprovision request (and the real reason for this change, based on our previous discussions). Interestingly enough, 403 is not among explicitly allowed response codes for OSB deprovision request. The spec just vaguely says

Responses with any other status code will be interpreted as a failure and the service instance will remain in the marketplace database.

OK, let's assume that 403 can be returned by any broker. What do we do in that case? Does the ServiceInstance get rolled back to the latest healthy state? Will user be able to update the spec and continue using it, as if the instance was never attempted to delete?

And what about requests for which you get 201 Accepted and then get status: failed in last_operation polling result? Will you just retry deleting as currently with deletionTimestamp set?

MHBauer commented 6 years ago

https://github.com/kubernetes-incubator/service-catalog/issues/1942#issuecomment-391717768

What do we do in that case? Does the ServiceInstance get rolled back to the latest healthy state? Will user be able to update the spec and continue using it, as if the instance was never attempted to delete?

Yes, I think this is the concept. The appropriate user should barely know that someone unauthorized was poking at their stuff.

And what about requests for which you get 201 Accepted and then get status: failed in last_operation polling result? Will you just retry deleting as currently with deletionTimestamp set?

I do not understand where this case is coming from.

nilebox commented 6 years ago

The appropriate user should barely know that someone unauthorized was poking at their stuff.

This is not how Kubernetes works. Once the spec is updated ("desired state"), it stays the same and Kubernetes never rolls it back, even if it fails to make it happen (make "actual state" match the "desired state") after many retries. It just gets stuck with the failed status.

Kubernetes can only reject the delete request on the fly, before it persists the changes - during authz and other validation, e.g. via admission controllers.

nilebox commented 6 years ago

I do not understand where this case is coming from.

I am just trying to understand the scope of this change. From the issue description:

At the April F2F we decided to add deprovision/unbind flags to allow the user to attempt a deprovision/unbind without having to delete the Service Catalog k8s resource.

I wasn't at the F2F and there are no implementation/behavior details. Looks like there is actually no agreement on that yet? The design spec in #2037 is very different from what the issue says.

nilebox commented 6 years ago

@kibbles-n-bytes after reading your comments about declarative existsExternally flag I agree that it's more tolerable than what #2037 says with deletionRequested flag and controller automatically deleting instance after successful deprovisioning. I would prefer to have an opposite flag though - that is false by default. Something like the paused deployment flag you mentioned. This would make it clear that this is an opt-in flag that you don't have to use and pay attention to.

And while the resource is marked as !existsExternally, the spec updates wouldn't try anything on the broker.

@kibbles-n-bytes and what happens if the user sets existsExternally back to true? Will the instance be provisioned again after deprovisioning successfully finished? I presume the answer is "yes". I suspect our code will get a lot more complicated than it already is.

nilebox commented 6 years ago

I would prefer to have an opposite flag though - that is false by default.

Another argument for the flag being false by default is backward compatibility. You cannot bring a new flag that is true by default - all the existing instances don't have this field.

nilebox commented 6 years ago

Another thing to think about: even after adding a new flag user can still issue kubectl delete serviceinstance ... and you'll end up in the same situation as now.

To mitigate this you would have to implement an admission controller that rejects DELETE request if the ServiceInstance hasn't reconciled yet or ProvisionStatus != NotProvisioned or DeprovisionStatus != NotRequired, i.e. only allow deletion after successful deprovisioning. Otherwise there is still a chance for an instance to get stuck forever because of 403.

P.S. If we agree to make such change, this whole hackery must be opt-in of course under a feature flag. I don't think the majority of Service Catalog users actually need any of that.

MHBauer commented 6 years ago

Kubernetes can only reject the delete request on the fly, before it persists the changes - during authz and other validation, e.g. via admission controllers.

It is impossible for kube to know that it shouldn't allow the delete to be persisted before contacting the broker and attempting a delete.

Another thing to think about: even after adding a new flag user can still issue kubectl delete serviceinstance ... and you'll end up in the same situation as now.

We can prevent this with RBAC rules for the actual user. This is as yet unimplemented and should be implemented by https://github.com/kubernetes-incubator/service-catalog/issues/2061

To mitigate this you would have to implement an admission controller that rejects DELETE request if the ServiceInstance hasn't reconciled yet or ProvisionStatus != NotProvisioned or DeprovisionStatus != NotRequired, i.e. only allow deletion after successful deprovisioning. Otherwise there is still a chance for an instance to get stuck forever because of 403.

Yes. I made this suggestion during the F2F, among others. The delete flag was proposed by someone else. I had thought of it, but did not thing of brining it up as it's so outside the bounds of kube behavior. This extra flag set on update was decided during the meeting as the most appropriate response.

P.S. If we agree to make such change, this whole hackery must be opt-in of course under a feature flag. I don't think the majority of Service Catalog users actually need any of that.

I agree.

nilebox commented 6 years ago

We can prevent this with RBAC rules for the actual user. This is as yet unimplemented and should be implemented by #2061

No you can't if the design is how @kibbles-n-bytes described it (that is different from how it's written in #2037) - user still needs to manually delete ServiceInstance after deprovisioning succeeded. As I said above, I prefer @kibbles-n-bytes's design with the declarative flag that controls provisioning/deprovisioning separately from creation/deletion (i.e. user can set and unset the flag at any time, any number of times), it's less hacky than deletionRequested and more flexible.

I would like to discuss this issue at the next SIG call (June 4) and not merge anything until we reach the consensus on design.

duglin commented 6 years ago

P.S. If we agree to make such change, this whole hackery must be opt-in of course under a feature flag. I don't think the majority of Service Catalog users actually need any of that.

I agree.

If we solve this via some new flag then we don't need to do anything special, like set a feature flag. If people want to use the normal delete flow then they will and it should just work. If they want to use the flag, then they will, and it'll just work. The opt-in is the explicit use of one of those two mechanisms. However, blocking the normal kube delete is something that I agree will need some kind of trigger. Whether that's done on a global, broker, instance or binding level is still TBD I guess but I agree that should be opt-in until the time we decide it should be the default flow for deprovision/unbind.

MHBauer commented 6 years ago

https://github.com/kubernetes-incubator/service-catalog/issues/1942#issuecomment-391914626 Does not apimachinery defaults or versioning handle this?

nilebox commented 6 years ago

@mhbauer no, defaults apply only to new objects. Versioning - yes, if you support multiple versions, i.e. you will have to declare v2 and duplicate all the types. Currently we only support one version at a time and keep it backward compatible.

MHBauer commented 6 years ago

v1beta2?

We cannot possibly be the first person to want to add a field with a non go default state.

On 5/25/18 15:43:47, Nail Islamov wrote:

@MHBauer https://github.com/MHBauer no, defaults apply only to new objects. Versioning - yes, if you support multiple versions, i.e. you will have to declare |v2| and duplicate all the types. Currently we only support one version at a time and keep it backward compatible.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes-incubator/service-catalog/issues/1942#issuecomment-392205999, or mute the thread https://github.com/notifications/unsubscribe-auth/AMuDAeKo06ej4Wqn2HvZvxJz3Kb6Kif4ks5t2IkjgaJpZM4TXMDK.

nilebox commented 6 years ago

Yes, simultaneously supporting both v1beta1 and v1beta2 for a long time (enough time for all users to migrate to v1beta2) would work. Which also means having a lot of extra boilerplate. Do we want this? I doubt that.

We can also make the field nullable (*bool) and treat nil as true, but that's ugly.

The flag that is false by default can be called disabled probably - with semantics that disabled instance shouldn't exist externally.

nilebox commented 6 years ago

@kibbles-n-bytes what happens to existing bindings if we mark an instance with existsExternally: false? Currently deprovisioning gets blocked by existing bindings. Do you also want to add existsExternally to bindings? Can you proceed with deprovisioning if binding exists but is marked with existsExternally: false? Our code could get very messy and entangled with such corner cases.

MHBauer commented 6 years ago

https://github.com/kubernetes-incubator/service-catalog/issues/1942#issuecomment-392232135

Do you also want to add existsExternally to bindings?

Both Instances and Bindings need this behavior, so both should get the new field.

nilebox commented 6 years ago

AFAIK bindings still lack the retry loop that I added to instances, i.e. reconciliation loop gets stuck after any error HTTP status received from the broker. This should be fixed first before adding this behavior IMO.

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 5 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 5 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 5 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/service-catalog/issues/1942#issuecomment-504704505): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.