Stop blocking Instance/Binding spec updates if there is an operation in progress

nilebox commented 6 years ago

TL;DR Let's remove the locking mechanism preventing spec updates while there is an operation in progress. Sending just last user that have updated the spec is consistent with "kube way" and is good enough.

Currently we prevent any spec updates while there is a provisioning/update operation in progress: https://github.com/kubernetes-incubator/service-catalog/blob/b3de6eca6ff3ed6049d85156a493ceee1db69097/pkg/apis/servicecatalog/validation/instance.go#L256-L268

The main reason for that is addressing #1149, i.e. preventing concurrent updates to be able to tie every spec change to a user who made the change, and send every change as a separate OSB request to broker with user info attached.

While this requirement is understandable from the OSB brokers point of view, it's inconsistent with "kube way" and is actually impossible to fully support, especially once we address retries after failures (#1715).

Example:

User A creates an instance with P1 parameters in the spec.
Service Catalog sends a provisioning request with P1 as payload and user A attached to it.
Provisioning fails with 400 Bad Request, because one of the required parameters is missing.
User B goes and adds a missing parameter P2 to the spec. Now the spec parameters is: P = P1 + P2.
Service Catalog sends a provisioning request with P = P1 + P2 payload and user B attached to it. In other words, we have lost information about user A there.

The other annoying part that we'll face after fixing #1149 is that users won't be able to fix the spec until Service Catalog stops retrying (which by default is 1 week). The only quick way out of this situation would then be deleting an instance and creating a new one, which leads to a very confusing UX.

Proposal: Stop preventing spec updates, and send the last user that mutated the spec as the one who is responsible for the entire spec.

/cc @pmorie @duglin @kibbles-n-bytes @staebler @jboyd01

kibbles-n-bytes commented 6 years ago

I agree. I am happy enough to just say that if you submit a spec for the resource, you are linked with that spec. And we haven't even been consistent with this behavior today; we currently allow spec edits after a failed update.

I think architecturally we'll have to be a bit careful to make sure we don't end up in tricky situations, but from a UX perspective I have few qualms.

duglin commented 6 years ago

After step 3, the failure, is there any reason that we don't reset the instance to the previous state? Do we have that info in the Status.

ash2k commented 6 years ago

After step 3, the failure, is there any reason that we don't reset the instance to the previous state?

That would be inconsistent with every other Kubernetes resource out there. E.g. Deployment object does not revert the change if it cannot pull the docker image. Spec holds the desired state provided by the user. We don't know what they would want to do if this state cannot be realized right now.

ash2k commented 6 years ago

The current behaviour contradicts one of the Kubernetes design principles:

Functionality must be level-based, meaning the system must operate correctly given the desired state and the current/observed state, regardless of how many intermediate state updates may have been missed. Edge-triggered behavior must be just an optimization.

I.e. it should be ok to miss intermediate updates and only act on the current desired state. And hence intermediate updates should not be blocked.

duglin commented 6 years ago

Yea, but that was also written w/o taking into account that some 3rd party system has control over our state, or any of the other constraints we're living under. Not saying what we have now is perfect, or even what we want going forward, but its where we landed given all of the factors involved.

nilebox commented 6 years ago

but its where we landed given all of the factors involved

As @pmorie commented in the original issue https://github.com/kubernetes-incubator/service-catalog/issues/1149#issuecomment-325394233:

This probably won't bear out in the long-term as a valid approach but it is good enough to get us to an initial beta.

As far as I understand, the current blocking behaviour was never considered to be finalized forever. And with adding support for retries (a very critical UX feature), continuing support of blocking will be extremely annoying and painful for users.

Also, I don't think that the behaviour without blocking will break the OSB spec, even if it's different from Cloud Foundry.

nilebox commented 6 years ago

@duglin @kibbles-n-bytes I have a proposal that can make the blocking behaviour opt-in: instead of hardcoding it in the API server validation, we can implement an admission controller that will do the same. Then it's up to the vendor/admins to decide whether this admission controller should be turned on or not.

nilebox commented 6 years ago

Another option is to add a parameter to BrokerSpec that will turn blocking on for the broker that relies on this behaviour heavily (i.e. performing a fine-grained RBAC authorization on its side).

Let's discuss these options at the next SIG call.

nilebox commented 6 years ago

@duglin you suggested to create an OSB issue to clarify this part of the spec. Can you create one? Thanks.

duglin commented 6 years ago

@nilebox I can but I can't remember which part of the spec people thought was ambiguous. Can you refresh my memory?

nilebox commented 6 years ago

@duglin somewhere around user info header? https://github.com/openservicebrokerapi/servicebroker/blob/master/profile.md#originating-identity-header

To be clear: I personally don't think it's ambiguous, I think it's good that spec is not specific about the platform's behavior there (i.e. whether every instance parameters change by users must be sent separately or not).

duglin commented 6 years ago

I agree that I don't think its ambiguous either - which is why I was asking for the bit that people thought needed more clarity. While, in general, I think you're right that the platform should probably be allowed to decide when to do things like batch up changes into one update() and then associate that with just one user, in this case I don't think that's what we're talking about.

If all of the edits to the spec were done by the same kube user then I think it might be ok to batch them, but when the edits are not all done by the same user and the identity header is used in such a way that the specific user associated with the action matters (e.g. for authentication purposes), then batching defeats the point of the header because the wrong person will be associated with all aspects of the change. And at that point it feels like we're violating the (at least) the spirit of the spec.

kibbles-n-bytes commented 6 years ago

I think calling it "batching" is a bit misleading; that almost implies that Service Catalog would be combining all requests at will for some sort of optimization. That's not the spirit of this change; this is only really relevant for failure situations, as successful requests would go through so quickly that in effect they'd be un-"batch"able. The spirit of this change is explicitly that, in the case where a request is not succeeding, if users want to modify the spec, they must be aware that their user identity will be tied to all pending changes.

In that interpretation, it's really not a discussion at the OSB spec level at all; it's a discussion solely about Service Catalog and its model of parameter ownership. In effect, we are 75% the way there already, and mostly just pretending that we're actually blocking spec changes. Parameters sourced from secrets can be updated willy-nilly (will the request even be tied to the correct user identity??), and after an update fails, we allow the user to edit the spec and trigger a new update, the entirety of which will be associated with that user's identity even if the previously-failed changes were made by someone else.

duglin commented 6 years ago

I know we’re not batching in general but that is the net result in this situation. Say someone modifies the plan and it fails because of an auth issue. Someone else then modified something else, say a param, and triggers an update. If the 2nd person has more auth than the first person then both updates will happen and be associated with the 2nd user even tho the plan update should never have been allowed at all. Do I have that right?

nilebox commented 6 years ago

@duglin You're right about the potential "batching effect", but again, it's out of the scope of OSB.

For example, Service Catalog could just say that it doesn't handle granular authorization of individual parameters at all, and user that updates a single parameter is responsible for all changed parameters since the last successful OSB operation. And that will be valid platform behavior according to OSB with correct header. It won't break brokers either. If Service Catalog users want more granular control, they will have to implement this with admission control or RBAC on the Kubernetes side.

As @kibbles-n-bytes said, that's what already happens in 75% of scenarios, and we can't fix that (well, theoretically we could start blocking Secret updates with admission controller, and rollback changes if OSB request fails, but it will be too far from how Kubernetes resources supposed to work).

duglin commented 6 years ago

@nilebox I'm not sure its out of scope of OSB. My example wasn't about updating individual parameters it was about updating parameters AND updating the plan as two separate actions by two different individuals with different rights.

This isn't about breaking brokers as much as its breaking the auth model of the platform that is in place today. With the proliferation of platforms (docker, cf, kube, etc...) we had to identify a common spot in the workflow that would allow us to perform additional business logic checks on the actions that wouldn't force us to write custom platform-specific code each time. And in fact, some platforms don't allow for the insertion of additional checks at all. So, the common (and guaranteed) spot to insert this logic was on the OSBAPI flow between the platform and the broker by using what we call a "proxy broker". Its basically part of the platform and can do any additional checks/logic that are needed regardless of the platform being used and what checks the platform allows us to insert.

I'd like to back up the discussion a bit and understand why upon receiving a "400 bad request" we can't just rollback the spec to the previous (good) state since the data provided was clearly bad. We could set a condition to indicate why it was rolled back so the user can understand why/what happened. It just seems to me that allowing someone to update the spec (not to fix it but to make other changes) when we know the current state is bad, and will probably continue to be bad, is just making the situation worse.

I agree this isn't the Kube way, but Kube wasn't designed to handle the situation where a remote system needed to be checked-with before a spec change was finalized. So this locking mechanism was put in place to deal with it.

I guess one option (and this feel really funky) would be to make the OSBAPI call as part of an admission controller and then only update the spec if the OSBAPI call worked - otherwise reject the change. :-) If you think about it, this is pretty much the rollback idea, because that's the same net result just w/o the synchronous nature of the flow.

nilebox commented 6 years ago

My example wasn't about updating individual parameters it was about updating parameters AND updating the plan as two separate actions by two different individuals with different rights.

@duglin I think it's more specific problem than we have discussed before. And there might be a specific solution for this. For example, block spec updates only if there is a plan change that haven't been reconciled yet.

why upon receiving a "400 bad request" we can't just rollback the spec to the previous (good) state since the data provided was clearly bad

We technically can, but it's not how Kubernetes resources work. Fixing the broken spec requires manual intervention from user. Also, we can't rollback secret parameters, so again we are talking about 25% of problem coverage there (while in other 75% it's still not covered).

It just seems to me that allowing someone to update the spec (not to fix it but to make other changes) when we know the current state is bad, and will probably continue to be bad, is just making the situation worse.

I disagree. Manual intervention is much better than rollback with magic that competes with user for updating the spec.

make the OSBAPI call as part of an admission controller and then only update the spec if the OSBAPI call worked - otherwise reject the change

That actually shows why rolling back the spec is bad. Because Kubernetes resources are reconciled asynchronously, and what you suggest with admission control is making it synchronous. Blocking spec updates is also trying to make them single-thread and synchronous in some way.

I don't think rollback is an option to consider at all, but would be happy to hear other people's opinion.

ash2k commented 6 years ago

It seems the cause of the issue is that the OSB protocol does not provide any means for the platform to consult the broker if the change is valid so that it could reject the request synchronously. There is schema to do validation but there are no means to check if the user is authorized. I was under impression until recently that authorization is not a concern OSB spec is trying to cover and I think the user info section was added only recently to the spec (is this correct?). It just feels the use case is not supported well at the spec level and hence there is no good way to support it in Kubernetes using the current OSB spec.

I'd like to back up the discussion a bit and understand why upon receiving a "400 bad request" we can't just rollback the spec to the previous (good) state since the data provided was clearly bad.

In addition to what has been said above, here are two more things to think about:

Rolling back would mean service catalog needs to store the previous "good" state. It might be ok because it may also be needed if service catalog wants to support sending patches (or old+new state in any other form) to brokers.
What to do if there is no previous "good" state - it was a create request and it has failed.

I agree this isn't the Kube way, but Kube wasn't designed to handle the situation where a remote system needed to be checked-with before a spec change was finalized. So this locking mechanism was put in place to deal with it.

Admission hooks is the mechanism to do this but as I said above there is no way in OSB spec AFAIK to perform an authz check.

nilebox commented 6 years ago

On the implementation side I propose that we remove the code blocking updates now (given that it doesn't solve what it was supposed to in most cases), and continue discusson on the new design and possible changes in OSB and its implementation in Service Catalog.

duglin commented 6 years ago

@nilebox I don't know what the correct next step is yet but I'd prefer if we discussed this on a Monday call (or some other call) and developed a design plan before we undo the current code that people spent quite a bit of time developing.

duglin commented 6 years ago

We're having some internal discussions around this issue and someone had a question.... what happens if user A tried to delete an instance but then the delete failed at the broker for some reason (e.g. they don't have the authority to do so), then user B tried to update the instance? Would the update be blocked due to the instance being marked for deletion? Can we get out of this situation w/o doing some kind of roll back?

nilebox commented 6 years ago

@duglin Deletion in Kubernetes is irreversible. Once an instance is marked with DeletionTimestamp, you can't "undelete" it. Also, any spec changes are rejected (i.e. you can't rollback, and you can't let the authorized user to touch instance to fix authz either). Controller can only update status and remove finalizers to proceed with deletion. This is definitely a problem worth discussing in the scope of authorization.

ash2k commented 6 years ago

Just to add to the above, in Kubernetes several authorization mechanisms are consulted before the object is marked for deletion (or is updated/created). Those mechanisms are built-in admissions controllers, admission hooks and configured authorization mechanism (RBAC by default). Kubernetes is not expecting a backend for a controlled object to perform any authn/authz after the spec has been updated/deleted. It should happen before the change is made.

MHBauer commented 6 years ago

There is a sort of precedent here with deployment rollouts and multiple replicasets. I'm not sure what the analogy is exactly, or how dep+rs work together, but it could be said that because rollout and undo work by switching between different versions of specs (defined on the rs), and that there is a history of edits available to choose from.

nilebox commented 6 years ago

@MHBauer good point! The way rollbacks work in deployments is that for every spec change (or rather a set of changes accumulated since the last rollout) there is a new replica set created, and the previous replica set being retained with pods scaled down to zero replicas: 0 instead of deleting. The number of replica sets to be retained is configurable via https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#revision-history-limit

Another aspect is that deployments don't run rollbacks automatically hidden from the user, there is a very specific imperative command that is explicitly executed by the user:

kubectl rollout undo deployment/nginx-deployment

Also see https://kubernetes.io/cn/docs/concepts/workloads/controllers/deployment/#rollback-to

Rollback To .spec.rollbackTo is an optional field with the configuration the Deployment should roll back to. Setting this field triggers a rollback, and this field will be cleared by the server after a rollback is done. Because this field will be cleared by the server, it should not be used declaratively. For example, you should not perform kubectl apply with a manifest with .spec.rollbackTo field set.

And obviously, you can still update the Deployment spec even if there is a rollout in progress.

So in our case it would mean that:

ServiceInstance would internally produce a ServiceInstanceSnapshot (just made up a name) object for every OSB update request we send. This ServiceInstanceSnapshot must include all fields that can be potentially changed (plan, inline parameters, secret parameters). Sounds too hard to implement IMO.
We would need to add an imperative command to kubectl (via plugin?) that would trigger the rollback. Or add a command to svcat to do that.
We still shouldn't prevent spec updates if there is an update in progress. In other words, we still won't be able to track every single spec change separately.

To be clear, I'm not a fan of this option, and also this feature looks like an opt-in extension to the standard behavior, i.e. if you never run the kubectl rollout undo command, you'll never use rollbacks.

nilebox commented 6 years ago

Also to add to above: when you rollback a Deployment, the environment variables picked up through envFrom won't be rolled back, i.e. only the Deployment spec is getting rolled back. In our case it means that we will only rollback plan and inline parameters, and not secret parameters.

Not sure what we would achieve by implementing rollbacks with the same semantics as Deployments. Sounds like we would have the same problems as we currently have, just with more complexity.

It's probably better to build rollbacks semantics at the higher level than Service Catalog (keeping history of both spec changes and secret parameters, and supporting imperative commands), if someone finds this useful.

MHBauer commented 6 years ago

I put this on the 16th agenda, we can discuss there, and either decide or schedule another meeting.

Sounds like it maybe wasn't the best analogy in terms of total behavior.

Rephrasing, so I make sure I understand: We can't do an update to fix a broken spec, because the validation fails it. At the risk of shooting myself in the foot, it sounds like the validation might be too strict for this case. I think we're abusing validation enough as it is. I don't want to suggest more changes at the moment.

(This part kind of rambles) Thinking through multiple auth checks:

both users have appropriate access
- everything is fine.
user 1 does, user 2 does not
- user 1 is authorized to make the change, we are processing, user 2 doesn't have access to change and so the broker rejects it, but only later. user1 is irritated, and while they may be authorized to make user2 change, they do not know what it is.
user 1 does not, user 2 does
- user 1 makes a change they are not allowed to, and our instance goes into a failure state, eventually. user 2 wants to make a different change, either
1. sees the failure state and can't know what user1 did to make it fail
2. sees in-progress, thinks everything is okay, and makes their change, only to be authorizing something they didn't necessarily wanting to be authorize.
neither user as access
- everything fails

(reengage semi coherent thought)

It seems like what we need to be doing is ensuring freshness (that user2 sees the updates user1 made, and can't make a change without including those) when changing specs, and make the users understand that they're authorizing the whole state.

At all times, I think we have the problem of user1 and user2 not necessarily knowing what change was made by the other, as they only want to authorize their own change, and by the time of the rejection, history is gone. Not sure what to do about this.

nilebox commented 6 years ago

It seems like what we need to be doing is ensuring freshness (that user2 sees the updates user1 made, and can't make a change without including those) when changing specs

This is guaranteed by design, otherwise update will fail with the conflict error.

and make the users understand that they're authorizing the whole state

Indeed, this is the part we need to be explicit about in the docs to avoid any confusion.

nilebox commented 6 years ago

@duglin if we want to elaborate on the synchronous authorization via admission controller, there is a new "authorized webhook" to be released in k8s 1.10: https://github.com/kubernetes/features/issues/516

duglin commented 6 years ago

admission controllers won't help in this case. In order for an AC to work it would need to fully understand the business logic that the broker is going to do while processing the update(). And since we don't even know which broker is going to be used until runtime, I don't see how we an do that. Or... if there was a way to do call the broker's update() func in some kind of "check mode" that would ask if the request would go thru w/o actually doing it, but OSB API doesn't define that.

I just don't see how to verify an incoming request w/o actually calling the broker to do it.

I supposed that could mean that we actually invoke the broker.update() as part of an AC (or as part of the apiserver.put() after all of the k8s checks have been done) and only store the successful results once it succeeded. Turns it into a sync flow though.

nilebox commented 6 years ago

@duglin you're right. What I meant is that authorization webhook could be an integration point for OSB authorization check request (TBD). As @ash2k pointed above:

OSB protocol does not provide any means for the platform to consult the broker if the change is valid so that it could reject the request synchronously.

and

in Kubernetes several authorization mechanisms are consulted before the object is marked for deletion (or is updated/created). Those mechanisms are built-in admissions controllers, admission hooks and configured authorization mechanism (RBAC by default). Kubernetes is not expecting a backend for a controlled object to perform any authn/authz after the spec has been updated/deleted. It should happen before the change is made.

Turns it into a sync flow though.

No, I didn't suggest sending update request as part of an AC. I meant that we might need a separate authorization check request in OSB.

duglin commented 6 years ago

That might be an option for v3 - just not sure how to do that in the v2.x timeframe.

duglin commented 6 years ago

and make the users understand that they're authorizing the whole state

Indeed, this is the part we need to be explicit about in the docs to avoid any confusion.

I'm not sure that's fully covers it. To me, its not just a matter of "approving" both user1 and user2's changes, its associated the correct set of changes with each user. That model only works if, from the user's POV, user1 undid their changes and then user2 asked for both sets of changes. Granted its a fine line because you could argue that as long as user2 understands the full set of changes then its equivalent to them doing them all, but that's really only true if user1 agrees to not have their ID associated with the the first set of changes - which means them rolling back their edits.

I'd like to separate out the concerns and deal with them independently:

1 - failures:

I wonder if it would help (or hurt :-) ) if we looked at a different usecase and see how we'd handle that to see if it provides some insight into how we might deal with this..... what if some action is taken on a svc-cat resource but the broker refuses to do it - for some reason - and will never do it. At what point should svc-cat give up and restore its state to match reality? This is probably more of a k8s question. I suspect the answer, in the normal k8s cases, is "never". It would be up to the user of the resource to issue a new put() to set the state appropriately (either to reality, or some other state that the broker will accept). However, in most (all?) k8s cases you can ask for the current state - we can't ask that of the broker. We do however at least try to keep track of the broker's state in our Status resources (although we don't seem to save the old plan). So, perhaps we either need to fully support keeping track of the broker's state in the Status() section so the user can know what values should be used, or support an undo() func that does it for the user and thus avoid the locking aspect that prevents us from issues a put() to fix things.

2 - blocking is not the kube way:

While I'm sympathetic to this, I don't think we can take an absolute firm stand on this because we're dealing with another system that doesn't operate in the same kind of "desired state" model that k8s does - OSB API is more of a serialized/synchronized RPC system where each action must be associated with the correct user.

So, even if we did remove the blocking aspect and allowed subsequent changes, since user1 MUST still be associated with their changes (unless there is some kind of undo of them), then we're faced with having some sort of queue of operations which could create a long list of pending actions that are all based on one that will never be allowed to happen. Which is probably worse that asking people to serialize their requests because it presents those subsequent users with a lie about the current state of the world.

Also, this isn't about the state of the world not matching the desired state "yet", its about the desired state being invalid and should never have been allowed to be stored in the spec in the first place - there is a difference.

nilebox commented 6 years ago

we can't ask that of the broker

Well, that's a hole in the OSB spec. There's simply no way for Service Catalog to tell what's the instance state without GET request if the update request has failed with timeout or 500 Internal Server Error error.

its about the desired state being invalid and should never have been allowed to be stored in the spec in the first place

There is no such thing as "invalid desired state" in k8s. If the initial request was valid (i.e. json schema validation has passed, and all admission controllers proceeded with request), and the spec was updated, the desired state is considered valid. And if there is an error reconciling it, it is up to the user (or automation around k8s) to apply the new changes that will fix the desired state.

nilebox commented 6 years ago

I feel like in order to replicate CloudFoundry mechanics and meet all requirements including authorization, we would have to get rid of etcd storage and controller-manager completely, and instead forward all requests directly to the brokers without persisting instances and bindings at all.

But again, that is not possible even in theory since there are no GET endpoints for instances and bindings in OSB API, which is a big mistake IMO.

duglin commented 6 years ago

There is no such thing as "invalid desired state" in k8s.

yup - exactly. But in our case we may have one.

I feel like in order to replicate CloudFoundry mechanics and meet all requirements ...instead forward all requests directly to the brokers without persisting instances and bindings at all

Well, we could store them in etcd once the broker has completed its processing - but that turns these operations into RPC ones. And I still can't figure out how bad that really would be.

nilebox commented 6 years ago

We haven't decided yet, but the proposed set of changes most people agreed with at the today's call was:

Get rid of locks completely, i.e. merge #1790
Add support for imperative rollback command for a nicer UX (if the user doesn’t want to take ownership of the changes in the desired state that are different from the last applied state). Rollbacks are performed only for the spec (no secret parameters rollback)
Maybe: opt-in admission controller for locks if they are still needed for some SC use cases.

nilebox commented 6 years ago

Now that #1765 is merged, this issue has become critical for non-happy-path and needs to be fixed ASAP.

fejta-bot commented 5 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 5 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 5 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 5 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/service-catalog/issues/1755#issuecomment-504633059): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes-retired / service-catalog

Stop blocking Instance/Binding spec updates if there is an operation in progress #1755