Closed kibbles-n-bytes closed 5 years ago
I don't think this is a realistic "MUST" - there's no way to make a guarantee that after the broker returns one of these results that the platform will be able to stop polling. Say for example that there's a catastrophic outage in the platform that results in the result of the operation being lost before it is persisted. Clearly you have to be able to retry the endpoint if something like that happens.
I agree that the broker should hold on to the result for more than just one GET response.
For the rest, it seems to me that "410 Gone" should be interpreted as a terminal condition and the platform should treat it as such. So in the case of a provision/bind, it should do orphan mitigation on that resource and deprovision/unbind it. Continuing to poll seems odd since a "410 Gone" means that its never coming back - so it shouldn't be returned on a temporary burp.
CF will investigate how it deals with platform failure before writing to the DB.
Our ansible broker maintains the lastoperation state for a while. As long as the operation field is given to the last_operation call we will return the state.
In CF, if after hitting the last_operation
endpoint we fail to save that the operation has finished to the database (when "state": "succeeded"
or "state": "failed"
), then polling continues.
Therefore CF expects that brokers should hold onto the result of the last operation for a service instance after it has returned a succeeded or failed state.
Hey @kibbles-n-bytes should we try to revive this issue?
I think your first suggestion sounds sensible and could be a good addition to the spec:
- Brokers should be encouraged to keep the operation around for a reasonable amount of time even after returning a "succeeded"/"failed" response, in case it hasn't been fully processed.
hrm that would suck for us, ours goes away now :) it used to hang around in storage for a while.
If this is only a SHOULD, then you SHOULD be OK though @jmrodri, right? 🙂
@mattmcneeney Yeah, we should get some resolution on this I think!
@jmrodri How quickly does the operation go away for you? Does it disappear immediately after the first "succeeded" GET?
@mattmcneeney yep, SHOULD will take care of it :)
@kibbles-n-bytes this only causes an issue for our deprovision case, we recently attached the job states to the instances so when the deprovision succeeds we return "succeeded" then the instance and job state go away. So a subsequent deprovision call will return 410 GONE. So maybe we're not in horrible shape.
Sounds like we are in agreement and should open a PR! @kibbles-n-bytes since you had the best ideas for this, I've assigned this issue to you 🙂
The spec says that when polling the
last_operation
endpoint for async operations:Does this mean the broker only guarantees returning a terminal state response once? If the platform has an internal issue and is unable to process that response after having hit the
last_operation
endpoint once, is it out of luck?If so, what response should the platform expect? For everything but a 400, the platform is told to interpret it as a situation to keep polling, and our description of 400 does not encompass this situation. This means it could potentially be a long time before the platform realizes something is wrong, and thus a long time of the user being billed for a resource before the platform attempts orphan mitigation.
Personally, I feel like: 1) Brokers should be encouraged to keep the operation around for a reasonable amount of time even after returning a "succeeded"/"failed" response, in case it hasn't been fully processed. 2) We should include in the spec some behavior expectation for when the broker has removed an operation so that platforms wouldn't keep polling. 3) We should potentially give operations a broker-defined expiration date, or a way for platforms to confirm to the broker that the operation can be safely removed.