openservicebrokerapi / servicebroker

Open Service Broker API Specification
https://openservicebrokerapi.org/
Apache License 2.0
1.19k stars 436 forks source link

Define a preferred async provision check #709

Closed ransombriggs closed 3 years ago

ransombriggs commented 4 years ago

What is the problem?

We were discussing the semantics of 200 during provisioning and binding and were unsure why create would be called a second time since the :instance_id MUST be a globally unique non-empty string and we are using orphan deletion to mitigate creation issues. I then found out why in a the 202 accepted definition.

This triggers the Platform to poll the Last Operation for Service Instances endpoint for operation status. Note that a re-sent PUT request MUST return a 202 Accepted, not a 200 OK, if the Service Instance is not yet fully provisioned.

From the way that this reads there appears to be two methods for checking if something is successfully provisioned or bound, either pool last_operation until success or call put until 200. Our provider implementation will support both for interoperability, but we are wondering which one we should prefer for our platform implementation.

Who does this affect?

platform authors

Do you have any proposed solutions?

The specification description should say whether last_operation or PUT until 200 is the preferred way to detect if something is fully provisioned.

ransombriggs commented 4 years ago

Had some further discussions on our teams and realized that we are never retrying on failure, but instead cleaning up orphans, but that some platforms could retry instead of doing orphan removal. Having a best practice around this for platforms re-using instance_id would be good as well. Additionally an explanation of why a synchronous provision that returns a 201 may get a second PUT request would also be helpful for providers.

gberche-orange commented 4 years ago

From the way that this reads there appears to be two methods for checking if something is successfully provisioned or bound, either pool last_operation until success or call put until 200. Our provider implementation will support both for interoperability, but we are wondering which one we should prefer for our platform implementation.

I understand that the requirement to support duplicate PUT returning as a 200 is to account for OSB client platforms that use eventual consistency (instead of atomic transactional consistency) and therefore make multiple provisionning calls in parallel (namely the K8S svcat see https://github.com/kubernetes-sigs/service-catalog/issues/1639).

It is possible that brokers have optimized for the canonical case, and only support K8S svcat duplicate calls with some performance impact (additional calls and response latency). This is how I dealt with in in my osb facade, see details in section Handle race conditions (including for K8S dups) of implemention notes.

Therefore if client platforms have the choice, I would suggest to avoid the duplicate calls and instead rely on sync or async responses to infer success of failure of a provisionning.

rsampaio commented 4 years ago

561 has some context on why 200 OK was added to the spec and I believe it should be cleaned up to remove ambiguous methods of verifying the completion of a provisioning request, I created the PR #717 removing mentions to status 200 and adding guidance for platforms relying on eventual consistency and I would appreciate the feedback.

rsampaio commented 3 years ago

fixed by #717