redpanda-data / redpanda-operator

39 stars 10 forks source link

Fix decommission process with Redpanda status #239

Closed andrewstucki closed 2 months ago

andrewstucki commented 2 months ago

So, my previous changes to unblock the Helm chart installation process broke our e2e decommissioning test by introducing a race condition into our test assertions. Basically, with the decommission controller there was a small window in which decommissioning had not yet happened and hence the nodes were technically unhealthy, and yet they were not marked yet as "unhealthy" within the statefulset.

What that meant is that the Redpanda CRD we were creating in our test was getting marked as "ready" for a split second, when it really wasn't and we were prematurely triggering a second scale down operation.

This fixes the test by setting a status condition if the controller detects that the statefulset for our Redpanda cluster needs to be "decommission" dead nodes. As a result, the CRD is never marked as "ready" until decommissioning is fully finished.

RafalKorepta commented 2 months ago

New flaky test --- FAIL: kuttl/harness/centralized-configuration-drift (388.05s)

RafalKorepta commented 2 months ago

https://buildkite.com/redpanda/redpanda-operator/builds/2628#0192017b-8a93-47bf-9a84-7f18a53f86b3/352-1021

2024-09-17 20:21:01 UTC | case.go:399: failed in step 3-delete-redpandas
-- | --
  | 2024-09-17 20:21:01 UTC | case.go:401: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline

Redpanda deletion doesn't work ATM. As kuttle artefacts are not generated it's hard to say what operator logs look like.

andrewstucki commented 2 months ago

I'm wondering if it has something to do with the timeout of the tests actually, just switched up some of the code to call the health endpoint directly rather than through the watchClusterHealth helper because if a node hasn't been marked as down yet - despite the fact that we know it needs to get decommissioned due to the replica count not matching the number of nodes that we know about, this function will block for up to 60 seconds: https://github.com/redpanda-data/redpanda-operator/blob/4182ff503ff70a565016b08c714867df81c76d62/src/go/k8s/internal/controller/redpanda/redpanda_decommission_controller.go#L659