Closed andrewstucki closed 2 months ago
New flaky test --- FAIL: kuttl/harness/centralized-configuration-drift (388.05s)
2024-09-17 20:21:01 UTC | case.go:399: failed in step 3-delete-redpandas
-- | --
| 2024-09-17 20:21:01 UTC | case.go:401: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline
Redpanda deletion doesn't work ATM. As kuttle artefacts are not generated it's hard to say what operator logs look like.
I'm wondering if it has something to do with the timeout of the tests actually, just switched up some of the code to call the health endpoint directly rather than through the watchClusterHealth
helper because if a node hasn't been marked as down yet - despite the fact that we know it needs to get decommissioned due to the replica count not matching the number of nodes that we know about, this function will block for up to 60 seconds: https://github.com/redpanda-data/redpanda-operator/blob/4182ff503ff70a565016b08c714867df81c76d62/src/go/k8s/internal/controller/redpanda/redpanda_decommission_controller.go#L659
So, my previous changes to unblock the Helm chart installation process broke our e2e decommissioning test by introducing a race condition into our test assertions. Basically, with the decommission controller there was a small window in which decommissioning had not yet happened and hence the nodes were technically unhealthy, and yet they were not marked yet as "unhealthy" within the statefulset.
What that meant is that the
Redpanda
CRD we were creating in our test was getting marked as "ready" for a split second, when it really wasn't and we were prematurely triggering a second scale down operation.This fixes the test by setting a status condition if the controller detects that the statefulset for our Redpanda cluster needs to be "decommission" dead nodes. As a result, the CRD is never marked as "ready" until decommissioning is fully finished.