Intended concurrency breaking change in v2.12 ?

gberche-orange commented 6 years ago

I lately realized that 2.12 release included changes in #193 that appear to me as a concurrency-related breaking change. Sorry I had missed this change at that time. It is surprising that the 2.12 release does not mention this breaking change neither as a semantic version bump nor as an explicit entry in its release note.

Prior to 2.12, service brokers were protected against concurrent calls for a given resource by the platform:

The marketplace MUST ensure that service brokers do not receive requests for a service instance while an asynchronous operation is in progress.

Since 2.12, the spec specifies brokers can be invoked concurrently, and that the responsibility for detecting concurrency calls made by the platform now lies on the service brokers:

If a broker receives a request that it is not able to process due to other activity being done on that resource then the broker MUST reject the request with an HTTP 422 Unprocessable Entity.

We do have at Orange async brokers that would break if the platform would invoke them concurrently on a given service instance e.g.

receiving a deprovisionning request while an async service instance provisioning is in progress.
receiving a deprovisionning request while an async service instance update is in progress.

Like many brokers, our brokers strive to limit operational requirements and often choose to be stateless. Complying to concurrency control as suggested in #193, places an heavy burden on brokers. This would require to our brokers include some kind of centralized persistent store or lock system for tracking pending requests per service instance, and then reject concurrent requests explicitly.

@duglin I wonder whether this behavior change was intended in #193 or is rather only an undesired wording change side-effect (the PR discussions don't provide much relevant details).

Also, I'd like to know if platforms started leveraging this change and relaxed the concurrency control they were applying prior to 2.12.

@mattmcneeney, CloudFoundry seems to still apply concurrency controls as far as I read from the tests below. Could you please confirm my understanding, and that in the future CF won't start sending concurrent requests to brokers without notifying the community ?

https://github.com/cloudfoundry/cloud_controller_ng/blob/7c61ce619607d59c7df117217cdc80d9ffc58aa5/spec/unit/actions/services/service_instance_delete_spec.rb#L207-L223 https://github.com/cloudfoundry/cloud_controller_ng/blob/7c61ce619607d59c7df117217cdc80d9ffc58aa5/spec/unit/actions/services/service_key_create_spec.rb#L36-L37

Is there are use-cases for plaforms to not apply concurrency controls as in v2.11 and rather send concurrent requests to brokers for a given resource ?

If so, I'd suggest that the OSB specs adds a way for brokers to declaratively indicate to platforms whether or not they support concurrent calls, freeing them from responsibility to actually detect and reject concurrent calls. This could be for instance a new flag in the /catalog endpoint service response object or a new input query parameter set to true on concurrent invocations (similar to the accepts_incomplete param).

If not, if might be good to restore back concurrency control by platforms in the specs.

gberche-orange commented 6 years ago

Is there are use-cases for plaforms to not apply concurrency controls as in v2.11 and rather send concurrent requests to brokers for a given resource ?

Seems #196 suggests cancelling an async operation as the primary use-case for #193

That's part of the reason I worded #193 the way I did - I wanted to loosen it a bit. I wanted to allow for the platform to send a DELETE at any time in order to recover from a hung request. The broker is always free to reject the DELETE request, but this at least gives the platform a way to try to kick/stop things.

There might be other options for supporting cancelling of a pending operation that don't put concurrency detection burden on the service broker to reject them: e.g. introducing an explicit new cancel endpoint that a broker can easily implement by systematically returning a 422 (without having to check for actual concurrency occurence).

duglin commented 6 years ago

@gberche-orange very sorry for the late response.

If I'm remembering correctly, there are several reasons for the change...

the old text only blocked async concurrent requests, not sync one. Which is a bit odd. Its possible for a sync request to take a long time so why wouldn't the spec block all concurrent requests?
also, we can't assume the platform is single threaded. Meaning, while it has an outstanding sync request it could have another part of the platform issuing a second sync request.
no matter what burden we try to place on the Platform the brokers should be prepared to handle misbehaving platforms. Whether platforms don't adhere to the spec on purpose or by accident, a broker should protect itself and its data (instances, bindings) in the case of spec-failures. Yes, the broker owner can complain that the platform misbehaved, but that will not change the unhappy state of the customer who may have had their instances go into a bad state as a result. I doubt they would care whose fault it was.
if brokers can't handle multiple requests at a time (async or sync) then it should block them itself - see previous point - then it can handle it for async and sync at the same time.
we are seeing more cases where the definition of "platform" is growing beyond a single entity. For example, there are now cases where services and instances are shared across multiple platforms - trying to coordinate all requests in those cases so that they are single-threaded may not be possible.

There may have been others, but I think when we talked about this at a previous f2f we decided that the current state of the spec didn't really make sense - so rather than trying to support a broker model we opted for one that seemed more logical.

gberche-orange commented 6 years ago

Thanks @duglin for the update. Some comments below:

we can't assume the platform is single threaded. Meaning, while it has an outstanding sync request it could have another part of the platform issuing a second sync request.

The platform is assigning unique guids to service instances, and performing polling of async responses. Therefore, it most often maintains transactional state around service instances, and therefore can usually track concurrency of outgoing calls it makes.

we are seeing more cases where the definition of "platform" is growing beyond a single entity. For example, there are now cases where services and instances are shared across multiple platforms - trying to coordinate all requests in those cases so that they are single-threaded may not be possible.

Same, concurrency control does not imply single threaded interactions with the brokers. AFAIK, the CF CC isn't single threaded and currently performs concurrency controls through its database. For multi-platforms "marketplaces", I guess some kind of coordination is used by those platforms which is likely to support concurrency control as well.

no matter what burden we try to place on the Platform the brokers should be prepared to handle misbehaving platforms. [...]

That's fine to have brokers protect themselves for platforms not compliant to the specs. That's a different story of the specifications allowing platforms to make concurrent platforms and pushing the burden on service broker authors.

I'd therefore suggest to revisit this spec section so that desired furture use-cases (such as cancellation) gets supported while keeping the broker authors responsibility low.

duglin commented 6 years ago

The platform is assigning unique guids to service instances, and performing polling of async responses. Therefore, it most often maintains transactional state around service instances, and therefore can usually track concurrency of outgoing calls it makes.

Creating UUIDs doesn't require much (if any) synchronization though.

I guess some kind of coordination is used by those platforms which is likely to support concurrency control as well.

We can't assume that. For example, if the "platform" is actually made up of Docker, CF and Kube then there may not be any coordination involved at all w.r.t. interactions with the brokers - unless you force all interactions to go thru a single point of control. Which is possible, but I wouldn't want to mandate that at the spec level.

That's fine to have brokers protect themselves for platforms not compliant to the specs. That's a different story of the specifications allowing platforms to make concurrent platforms and pushing the burden on service broker authors.

I agree that there is a difference between what will happen in code and what the spec should say. And in this case I actually think the spec should allow more, not less. Given that brokers should protect themselves and that the spec only blocked concurrent async requests (not sync ones - which is really odd), combined with the idea that we can't mandate an implementation choice on platforms (meaning single point of control) I think the choice we made is correct.

I'm curious though... how do you deal with concurrent sync requests? The spec allows for that - even w/o this change.

Either way, I'll try to bring this up on one of our call, or the next f2f.

gberche-orange commented 6 years ago

Thanks for the additional details

I'm curious though... how do you deal with concurrent sync requests? The spec allows for that - even w/o this change.

My understanding is that CF only sends concurrent binding or unbinding request to a service instance, but never sends concurrent updates/deletes/deprovisions, and notifies the user that a request is in progress. See service instance delete and service key tests.

Our broker supports concurrent service binding creation requests which their distinct binding ids (i.e. don't need concurrency control, unlike other potential concurrent requests that the specs allows)

mattmcneeney commented 6 years ago

Apologies also for the delayed response @gberche-orange. You're right that CF does currently prevent concurrent calls to a service broker regarding a specific service instance. We don't have that for binding (yet) as we only support synchronous creation and deletion. Rest assured that this will be in place for the foreseeable future!

Whilst this is unlikely to change, I do agree with some of @duglin's thoughts above, including:

The definition of "Platform" is becoming vaguer, and we are actively looking into how service brokers can be consumed from multiple platforms and how service instances could be shared across those platforms. Ensuring no concurrent calls can be made in that world becomes very complex, and would likely lead to brokers having to be defensive just in case.
The spec. only ruled out concurrent async requests, so it's likely that many brokers are already being defensive in case of concurrent sync requests.

Is the broker you're talking about completely stateless? Is that why blocking concurrent requests is tricky?

gberche-orange commented 6 years ago

thanks @mattmcneeney for your reply and you confirmation about current CF behavior w.r.t. concurrency.

Ensuring no concurrent calls can be made in that world becomes very complex [...]

I understand from https://github.com/Peripli/specification#how-it-works that broker calls will go through a Service Manager core component which can easily control concurrency. Is there other architectures being considered beyond "Service Manager" I would have missed ?

The Service Manager [...] core component communicates with the registered brokers and acts as a platform per Open Service Broker specification for them. [...] When the Platform Instance makes a call to the service broker, for example to provision a service instance, the broker proxy accepts the call, forwards it to the Service Manager, which in turn forwards it to the real broker.

Is the broker you're talking about completely stateless? Is that why blocking concurrent requests is tricky?

Our broker requests service provisioning to a CI/CD system, and only waits for async completion of requests. The state is therefore managed within the CI/CD system for which we don't yet have an API to lookup a pending request for a given service instance. Detecting concurrent calls is therefore indeed tricky. See https://github.com/orange-cloudfoundry/cf-ops-automation-broker#overview for more details.

I suspect there will be other brokers in the community that would be affected when platforms start sending concurrent updates/deletes/deprovisions requests for a given service instance.

My suggestion is that OSB specs explicitly support for non-concurrent brokers, e.g.

specify that service instance provision/update/deprovision calls are serialized. When a cancel operation gets introduced in the future, only the cancel operation would be sent concurrently to other pending async operations.
brokers explicit mention in their catalog whether they support being called concurrently for a same service instance: concurrent: true. When a broker opts-out of concurrent calls, then platforms must serialize calls.

mattmcneeney commented 6 years ago

Thanks @gberche-orange

I understand from https://github.com/Peripli/specification#how-it-works that broker calls will go through a Service Manager core component which can easily control concurrency. Is there other architectures being considered beyond "Service Manager" I would have missed ?

I believe there are Service Brokers in the wild that use things like configuration parameters to allow for 'sharing' of service instances across platforms without the need for an intermediary like Service Manager. In this case, the Service Broker is simply registered into multiple platforms, and so there is no way to enforce serial requests across these. If the Service Manager specification enforces serial requests, then you're right that conforming components should protect these kind of brokers.

When a cancel operation gets introduced in the future, only the cancel operation would be sent concurrently to other pending async operations.

Wouldn't supporting a cancel operation require some state? Or would you have enough information in the request to cancel something in your CI/CD system?

brokers explicit mention in their catalog whether they support being called concurrently for a same service instance: concurrent: true. When a broker opts-out of concurrent calls, then platforms must serialize calls.

This could work, but only if platforms obey the flag, and only if there is something enforcing concurrency if a broker is being used in multiple platforms.

gberche-orange commented 6 years ago

Thanks @mattmcneeney

Wouldn't supporting a cancel operation require some state? Or would you have enough information in the request to cancel something in your CI/CD system?

In our current state, our broker would simply reject incoming cancel operations, expressing to platform it does not support cancel operations. The other update/unprovision calls would safely assume to come serialized.

duglin commented 6 years ago

At the last f2f meeting we discussed this and decided to keep it as-is and to close the issue.

openservicebrokerapi / servicebroker

Intended concurrency breaking change in v2.12 ? #467