redpanda-data / redpanda-operator

39 stars 10 forks source link

Calculate Redpanda CRD status conditions based on StatefulSet/Deployment status and unblock HelmReleases on job completion #227

Closed andrewstucki closed 2 months ago

andrewstucki commented 2 months ago

This commit does a few things:

  1. It calculates the Ready status condition of Redpanda objects based on statefulset and deployment status.
  2. It removes the wait during helm applies NOTE that this means we don't know if after upgrade hooks have run when the HelmRelease says it's done
  3. It removes most of the superfluous RetryAfter/Retry usage in the reconciler since the For/Owns watch setups will retrigger a reconciliation on a change in any dependent resource.

Number 3 is a bit of hygiene, but 1 and 2 are necessary due to the way that we currently block on any sort of helm operation. After reviewing the internal flux controller code, it looks like once a helm operation is kicked off the only way to cancel it before it succeeds is for 1. the operation to timeout (after 15 minutes in our case), or 2. to forcibly restart the operator pod.

We currently have two issues that are caused by this behavior:

  1. If an upgrade takes longer than 15 minutes due to slowness in our job containers (typically when a cluster is under load?) we won't ever succeed.
  2. If someone makes a bad cluster configuration change that kills one of the pods (say by setting the resource allocations to 40TB of RAM) by making it either unscheduleable or in a restart loop, then any additional changes to the CRD are blocked by the helm upgrade waiting for an operation that will never complete -- so, subsequent changes are blocked for the full 15 minute timeout.

By removing the wait operations in the HelmRelease, we just fire and forget the upgrades, which is actually a lot more inline with what we will eventually want to do when we remove flux altogether -- just apply resources and check statuses to back-propagate a status onto the Redpanda CRD. As a result, both 1 and 2 are solved, but we now need to have a better gauge as to when an installation is actually complete, namely, we should make sure that the pods its creating are actually marked as "Ready".

Fixes both K8S-324/K8S-323/https://github.com/redpanda-data/redpanda-operator/issues/196 and K8S-341/https://github.com/redpanda-data/redpanda-operator/issues/217