servicebinding / spec

Specification for binding services to k8s workloads
https://servicebinding.io
Apache License 2.0
92 stars 35 forks source link

Need to specify what happens when resources are removed #170

Closed arthurdm closed 2 years ago

arthurdm commented 3 years ago

The spec needs to explicitly call out what happens when:

  1. a bound service is no longer available (i.e. the resource was deleted)
  2. the bound application has been deleted
  3. the ServiceBinding CR has been deleted

For 1 and 2, what if the removal was a temporarily pod restart?

scothis commented 3 years ago
  1. a bound service is no longer available (i.e. the resource was deleted)

The ServiceBinding should report the error on its Ready condition and remove the injected binding from the application workload.

  1. the bound application has been deleted

The ServiceBinding should report an error on its Ready condition. There's nothing to "unbind".

  1. the ServiceBinding CR has been deleted

The injected binding should be removed from the application workload. Implementations will likely use a finalizer to manage this state.

For 1 and 2, what if the removal was a temporarily pod restart?

The joy of continuous, level-based reconciliation is that the system will heal itself. Hopefully the system doesn't thrash in the interim.


In general, if a ServiceBinding is created and removed the application workload should be identical before the binding was created and after it was removed. The one exception to this I've found in practice is the SERVICE_BINDING_ROOT env var. Since other tools may also depend on that value, we should leave it in place.

pedjak commented 3 years ago

We should be just careful to avoid the spec requirements that are impossible/hard to implement or when implemented deliver bad performances and/or high resource utilization.

Detecting changes on services and/or application requires some sort of watching/webhooks and on real production (large) clusters this has a huge impact on performances and resource utilization, because we need for example to watch for all deployments/secrets on the cluster, although the majority of them are not bound to any application. Our implementation tried to apply that strategy but we were hit on large real clusters, see https://github.com/redhat-developer/service-binding-operator/pull/903#issuecomment-803969568 for the graphs. The way out was to stop these watches.

After rethinking we could not identify a real usecase when service bindings could not be removed together with service/application.

baijum commented 3 years ago

Detecting changes on services and/or application requires some sort of watching/webhooks

I think the spec shall make recommendations for the correctness of the behavior. But I don't see a need to make any mandates here (SHOULD vs MUST). So, an implementation can choose to ignore the recommendation and still be compliant with the spec. (Ref. Notational Conventions section last paragraph)

BTW, an implementation can reconcile using a timer without watching resources. But spec need not talk about the implementation details.

If an implementation chooses to use watches, caching resources will be helpful. Please take a look at this IBM project: https://github.com/IBM/controller-filtered-cache (I think the same feature is available in controller-runtime)

baijum commented 3 years ago

If the operator implementation uses controller-runtime (kubebuilder/operator-sdk uses it), there is a SyncPeriod option.

From the docs:

    // SyncPeriod determines the minimum frequency at which watched resources are
    // reconciled. A lower period will correct entropy more quickly, but reduce
    // responsiveness to change if there are many watched resources. Change this
    // value only if you know what you are doing. Defaults to 10 hours if unset.
    // there will a 10 percent jitter between the SyncPeriod of all controllers
    // so that all controllers will not send list requests simultaneously.
    //
    // This applies to all controllers.
    //
    // A period sync happens for two reasons:
    // 1. To insure against a bug in the controller that causes an object to not
    // be requeued, when it otherwise should be requeued.
    // 2. To insure against an unknown bug in controller-runtime, or its dependencies,
    // that causes an object to not be requeued, when it otherwise should be
    // requeued, or to be removed from the queue, when it otherwise should not
    // be removed.
    //
    // If you want
    // 1. to insure against missed watch events, or
    // 2. to poll services that cannot be watched,
    // then we recommend that, instead of changing the default period, the
    // controller requeue, with a constant duration `t`, whenever the controller
    // is "done" with an object, and would otherwise not requeue it, i.e., we
    // recommend the `Reconcile` function return `reconcile.Result{RequeueAfter: t}`,
    // instead of `reconcile.Result{}`.

This provides yet another option for those who do not want to watch secondary resources.

pedjak commented 3 years ago

SyncPeriod determines the minimum frequency at which watched resources are // reconciled. A lower period will correct entropy more quickly, but reduce // responsiveness to change if there are many watched resources. Change this // value only if you know what you are doing. Defaults to 10 hours if unset.

and by default it is set to 10hours. This cannot be used for detecting changes on services/applications. If we lower the time, the pressure on event queue will go up significantly on real clusters with a lot of workloads, and at the same time response times will go up.

If an implementation chooses to use watches, caching resources will be helpful

That is also default behavior by controller-runtime.

baijum commented 3 years ago

If an implementation chooses to use watches, caching resources will be helpful

That is also default behavior by controller-runtime.

I don't think filtered caching based on selectors can be enabled by default. Because it requires input from the operator author. See this example usage (Looks like a Red Hat project): https://github.com/open-cluster-management/multicluster-observability-operator/blob/02c4714980535234fa28e0ec872407203afd815d/main.go#L34

baijum commented 3 years ago

For 1 and 2, what if the removal was a temporarily pod restart?

The joy of continuous, level-based reconciliation is that the system will heal itself.

+1

It's important that the spec make recommendations due to voluntary and involuntary disruptions: https://kubernetes.io/docs/concepts/workloads/pods/disruptions/

scothis commented 3 years ago

It's important that the spec make recommendations due to voluntary and involuntary disruptions

@baijum can you expand on this?