"last_operation" behavior after having returned "succeeded"/"failed" once?

kibbles-n-bytes commented 6 years ago

The spec says that when polling the last_operation endpoint for async operations:

A response with "state": "succeeded" or "state": "failed" MUST cause the platform to cease polling.

Does this mean the broker only guarantees returning a terminal state response once? If the platform has an internal issue and is unable to process that response after having hit the last_operation endpoint once, is it out of luck?

If so, what response should the platform expect? For everything but a 400, the platform is told to interpret it as a situation to keep polling, and our description of 400 does not encompass this situation. This means it could potentially be a long time before the platform realizes something is wrong, and thus a long time of the user being billed for a resource before the platform attempts orphan mitigation.

Personally, I feel like: 1) Brokers should be encouraged to keep the operation around for a reasonable amount of time even after returning a "succeeded"/"failed" response, in case it hasn't been fully processed. 2) We should include in the spec some behavior expectation for when the broker has removed an operation so that platforms wouldn't keep polling. 3) We should potentially give operations a broker-defined expiration date, or a way for platforms to confirm to the broker that the operation can be safely removed.

pmorie commented 6 years ago

I don't think this is a realistic "MUST" - there's no way to make a guarantee that after the broker returns one of these results that the platform will be able to stop polling. Say for example that there's a catastrophic outage in the platform that results in the result of the operation being lost before it is persisted. Clearly you have to be able to retry the endpoint if something like that happens.

duglin commented 6 years ago

I agree that the broker should hold on to the result for more than just one GET response.

For the rest, it seems to me that "410 Gone" should be interpreted as a terminal condition and the platform should treat it as such. So in the case of a provision/bind, it should do orphan mitigation on that resource and deprovision/unbind it. Continuing to poll seems odd since a "410 Gone" means that its never coming back - so it shouldn't be returned on a temporary burp.

duglin commented 6 years ago

CF will investigate how it deals with platform failure before writing to the DB.

jmrodri commented 6 years ago

Our ansible broker maintains the lastoperation state for a while. As long as the operation field is given to the last_operation call we will return the state.

mattmcneeney commented 6 years ago

In CF, if after hitting the last_operation endpoint we fail to save that the operation has finished to the database (when "state": "succeeded" or "state": "failed"), then polling continues.

Therefore CF expects that brokers should hold onto the result of the last operation for a service instance after it has returned a succeeded or failed state.

mattmcneeney commented 6 years ago

Hey @kibbles-n-bytes should we try to revive this issue?

I think your first suggestion sounds sensible and could be a good addition to the spec:

Brokers should be encouraged to keep the operation around for a reasonable amount of time even after returning a "succeeded"/"failed" response, in case it hasn't been fully processed.

jmrodri commented 6 years ago

hrm that would suck for us, ours goes away now :) it used to hang around in storage for a while.

mattmcneeney commented 6 years ago

If this is only a SHOULD, then you SHOULD be OK though @jmrodri, right? 🙂

kibbles-n-bytes commented 6 years ago

@mattmcneeney Yeah, we should get some resolution on this I think!

@jmrodri How quickly does the operation go away for you? Does it disappear immediately after the first "succeeded" GET?

jmrodri commented 6 years ago

@mattmcneeney yep, SHOULD will take care of it :)

@kibbles-n-bytes this only causes an issue for our deprovision case, we recently attached the job states to the instances so when the deprovision succeeds we return "succeeded" then the instance and job state go away. So a subsequent deprovision call will return 410 GONE. So maybe we're not in horrible shape.

mattmcneeney commented 6 years ago

Sounds like we are in agreement and should open a PR! @kibbles-n-bytes since you had the best ideas for this, I've assigned this issue to you 🙂

openservicebrokerapi / servicebroker

"last_operation" behavior after having returned "succeeded"/"failed" once? #359