Redesign: Stable and Idempotent Sync Ops

copejon commented 5 years ago

Background

Today, the library reacts to an OBC add event by calling Provision()/Grant(), and creating a CM, Secret, and OB. If at any point during this process an error occurs, the library calls Delete().

This creates a situation where Provision() or Grant() may return success but api server failures cause the library to take a scorched earth policy and completely destroy all artifacts, including the new bucket, authentication credentials, and api resources.

Problem 1

This is extremely inefficient. Once every provisioning artifact is gone, they then have to be recreated. Since most errors are persistent, this cycle continues at an exponential backoff, spamming logs, api calls, and object store operations.

Problem 2

It is prone to instability. If cleanup fails, the sync function exits and leaves an irreconcilably dirty environment. The next sync starts off assuming a clean environment, and either

a) errors due to collisions on statically named resources, or b) avoids collisions with generated names, thus spamming object store artifacts and orphaning them immediately.

The conclusion is that when the library works, it's great. When it doesn't, it actively confuses debugging by thrashing artifact creation, spamming logs, and likely never reconciles.

The Proposal

The library must be more nuanced in how it reconciles Actual State of the World(ASotW) to Desired State of the World(DSotW). Do away with cleanup-on-error entirely. Preserve backwards compatibility.

Instead, every sync operation should first assess the ASotW, determine what is unfulfilled, and proceed to selectively reconcile the situation.

For example, this abbreviated OBC is created

kind: objectBucketClaim
metadata:
  name: my-claim
spec:
  storageClass: bucket-class
  bucketName: my-bucket

The library calls Provision(), which succeeds, and returns an OB. At any point after this, an error may occur. Instead of immediately cleaning up and exiting, it must preserve state for the next go around. This is a 3 fold operation.

State must be stored in memory and be safely accessible between syncs. Golang's sync.Map is used by k8s' external provisioner library internally and provides a means of persisting sync state. As the data we want to store is encapsulated in the OB, this would be a good candidate to store in the map. The key would then be the queue key - the string value of the OBC's "namespace/name" as it's always the same between sync ops.
All API resources should be created, as is reasonable. If Provision() returns an error, then nothing should posted to the API server. If Provision() succeeds, then an error creating 1 resource should not prevent the creation of the others.
OBC and OB status must reflect what's going on. Status.Phase should reflect the currently observed state. Most likely, an error will cause the OBC to transition from Pending, to Error, to Bound (assuming it was able to reconcile the error). Status.Conditions must be used to communicate the specific errors that are occurring. This should be a purely reportive operation and not used to determine ASotW. For why this is the case, see the long running k8s issue on the subject

Related to #151

Critical questions, suggestions, and ideas welcome @guymguym @dannyzaken @nimrod-becker @liewegas @travisn @jeffvance @erinboyd

travisn commented 5 years ago

Makes sense to move in this direction instead of attempting to remove all the resources upon failure. The general assumptions I would expect is:

The reconciliation loop is idempotent. If it is run a second time, it will result in the same state.
The reconciliation loop will check if a desired resource already exists. If so, continue. If not, create it.
If there is failure, the resources already created should remain. The next reconciliation loop should start again from the beginning of the process to reconcile a resource. If a resource already exists, continue with the next step in the reconciliation until (eventually) the full desired state will be achieved. Even if it takes many retries, as long as it reaches desired state. If clients attempt to consume a partial resource, it's fine if it fails.
If at all possible, the only state that needs to be kept is informational only (e.g. on the OBC or OB status). Otherwise, the operator only needs to reconcile the difference between desired and actual state to create the individual resources such as object users, secrets, buckets, etc.

copejon commented 5 years ago

@travisn Bingo! This is pretty much exactly what I'm envisioning. As an added benefit, logging should be exponentially cleaner as well, lending to easier debugging.

If at all possible, the only state that needs to be kept is informational only (e.g. on the OBC or OB status). Otherwise, the operator only needs to reconcile the difference between desired and actual state to create the individual resources such as object users, secrets, buckets, etc.

When I say state, I'm talking about the endpoint and authentication date. There's a couple cases where we'd want to persist this data in memory between sync calls. If the sync returns before the OB is created, the endpoint and auth data is lost. If the OB is created but the Secret is not, the auth data is lost (because the OB API resource does not store auth data).

When Provision() or Grant() return the immediate operation should be to write the information to the map before doing anything that may generate an error.

Once a sync succeeds, the final step would be to delete the kv from the map.

guymguym commented 5 years ago

Hey @copejon Sounds like a positive direction. It sounds like this is the default behavior that any controller should adopt - a roll-forward approach (rather than roll-back and retry). I don't know exactly why there are questions about how to implement because it just sounds like the provisioner should get the secret,cm,ob and then create/update/noop according to the expected state.

dannyzaken commented 5 years ago

why must the state be stored in a map? isn't it enough to just get the resources (ob, secret, cm) and continue from where it was left on the previous reconciliations?

nimrod-becker commented 5 years ago

Agree with Danny, unless we are missing something here, it's better to discover the state than save it and handle updated (and potential collisions between the actual state and saved state). What's preventing us from doing so?

copejon commented 5 years ago

@dannyzaken Fair point - I could have been a lot more clear in that paragraph.

I agree, ASotW should be assessed from the API server. Whether those objects created or not is what we want to know at the beginning of each sync and the API server is the ASotW.

The map wouldn't indicate ASotW but it could persist bucket related data that should be in those API objects. E.g. at the beginning of a sync, if a Secret failed to create in the previous iteration, we'd have no way of retrieving the access keys. IMO persisting them in memory until the operation succeeds or the OBC is deleted is the safest way to do so.

dannyzaken commented 5 years ago

@copejon I agree that in some cases (like failing to store the secret) the provisioner will not have enough information to continue. But these cases should be handled anyway to cover any crashes of the provisioner, where persisting the state in memory is not sufficient.

if we agree that these cases must be covered, then why not use the same implementation for all cases?

copejon commented 5 years ago

if we agree that these cases must be covered, then why not use the same implementation for all cases?

Are we talking about persisting the temporary data on disk then? I'm 100% for a solution that can survive provisioner crashes - I'm only concerned that coupling that implementation with the one proposed here will prolong development.

dannyzaken commented 5 years ago

I don't think that persisting to disk is a good direction to take, but I think that the provisioner should still be able to recover from crashes at any point.

Is all other state information (other than the access keys), stored in the OB? If so, then beside the secret, recovery should be easy, isn't it?

copejon commented 5 years ago

I don't think that persisting to disk is a good direction to take, but I think that the provisioner should still be able to recover from crashes at any point.

Yeah, agree. But we'd need some stable, persistent location to write to. My opinion is that you can't fully trust the API server to be that persistent store. Connectivity becomes an issue when you have a lot of concurrent calls to the server and every so often, it just fails to connect.

I still think this is a separate, larger issue and we should focus first on stabilizing the runtime.

Is all other state information (other than the access keys), stored in the OB? If so, then beside the secret, recovery should be easy, isn't it?

The in-memory OB that Provision() returns contains the auth and endpoint data. When the OB is posted to the api server the auth data is omitted.

dannyzaken commented 5 years ago

if failures to write to the API server are common enough, then using sync.Map as temp storage is a good solution.

as for crashes, I guess that if any data is missing the provisioner can always fall back to the current solution (delete and retry)

copejon commented 5 years ago

That's a fair stop gap i think.

jeffvance commented 4 years ago

Suppose the provisioner fails right after the Provision method returns success (and thus the provisioner has created the bucket and associated policy), the lib hasn't even written to the syncMap. In this case, once the provisoiner is re-run the lib will see OBC add events and would normally call Provision. If we continue this then provisioners need to be idempotent and this issue, to me, boils down to shrinking a window. Am I seeing it correctly?

copejon commented 4 years ago

If we wanted to close the failure window completely and have total idempotency, we'd need to be able to detect that the previous Provision() succeeded and that the bucket data was lost. One hitch here is that we now have a ready bucket in the object store, but we lost its name.

Even if bucketName is given in the OBC, it still may be that the name would have collided with another bucket, so we can't safely assume that this particular OBC owns that bucket.

If generateBucketName is given though, then this allows for the obc yaml to be created multiple times to generate multiple buckets, all with the same prefix. We wouldn't be able to differentiate which bucket belongs to which instance of th OBC.

Lastly, we'd need some new interface method - Recover()? - to ask provisioners to retrieve this data for us.

Given that we can't guarantee which obc owns which bucket after a failure at a precise step in the process, I think that for now, we work towards shrinking the window down to between the Provision() and data persistence operation.

jeffvance commented 4 years ago

Re sync.Map from this article:

The results show the sync.Map implementation is systematically ~20% slower than the mutex solution. This should be expected, given the use of interface{} and the extra internal bookkeeping. I'd likely choose to use the mutex-based solution as library code should provide as little overhead as possible.

copejon commented 4 years ago

Closing. Offline discussion has changed towards a more CDI-like implementation.

kube-object-storage / lib-bucket-provisioner