shipwright-io / build

Shipwright - a framework for building container images on Kubernetes
https://shipwright.io
Apache License 2.0
667 stars 113 forks source link

Review the reconcilation logic to prevent system overload #143

Open SaschaSchwarze0 opened 4 years ago

SaschaSchwarze0 commented 4 years ago

We just had the situation on our development cluster that two build custom resources in the system were already defining the service account as an object. Due to a mistake during deployment, an old build operator was expecting a string there.

The result is that the reconcilation happens endlessly. And this is just one sample, other reasons for reconcilation are bad references to credentials. To prevent a system overload from these reconcilations, we should do things:

  1. Apply a delay time when reconciling, see discussion at https://github.com/redhat-developer/build/pull/109#issuecomment-614079616
  2. Investigate whether we can stop the reconcilation process if the user does not fix the root cause within a certain time (maybe one hour) and put the custom resource into some "permanently failed state"
qu1queee commented 4 years ago

This is not a bug, it works as designed. It would be good to know numbers of a potential performance degradation when reconciles never stops. Adding this to #174 for a short discussion.

sbose78 commented 4 years ago

Agreed, this constant reconcilation paradigm does feel little chatty. Though in general, it isn't expensive. Nevertheless, would be good to see what's the resource footprint.

qu1queee commented 4 years ago

@SaschaSchwarze0 do u know if we have some internal results around this? or are this metrics(multiple reconciles system overload) something we can request to Emily or similar to get for us?

SaschaSchwarze0 commented 4 years ago

@qu1queee no, I do not have results. But agree, would be interesting to see the difference between a performance run on a clean system vs one where 1000 (just a random number) build runs are reconciling because of some failure.

otaviof commented 4 years ago

The result is that the reconcilation happens endlessly. And this is just one sample, other reasons for reconcilation are bad references to credentials. To prevent a system overload from these reconcilations, we should do things:

1. Apply a delay time when reconciling, see discussion at [#109 (comment)](https://github.com/redhat-developer/build/pull/109#issuecomment-614079616)

Good approach! This will ease the pressure the API-Server will try requeue failed attempts.

2. Investigate whether we can stop the reconcilation process if the user does not fix the root cause within a certain time (maybe one hour) and put the custom resource into some "permanently failed state"

An example of permanent failed state can be taken from service-binding-operator:

// NoRequeue returns error without requeue flag.
func NoRequeue(err error) (reconcile.Result, error) {
    return reconcile.Result{}, err
}

Additionally, we should define the different result scenarios as dedicated functions, to inform the Kubernetes API-Server how to proceed, and re-use this behavior throughout the operator.

As a practical example, please consider the methods defined here.