Closed rzetelskik closed 1 month ago
@rzetelskik: GitHub didn't allow me to request PR reviews from the following users: rzetelskik.
Note that only scylladb members and repo collaborators can review this PR, and authors cannot review their own PRs.
/cc zimnx tnozicka
@rzetelskik: The following test failed, say
/retest
to rerun all failed tests or/retest-required
to rerun all mandatory failed tests:Test name Commit Details Required Rerun command ci/prow/e2e-gke-parallel-clusterip cf03afe link true
/test e2e-gke-parallel-clusterip
Full PR test history. Your PR dashboard.
cluster provisioning failed /retest
@rzetelskik: The following test failed, say
/retest
to rerun all failed tests or/retest-required
to rerun all mandatory failed tests:Test name Commit Details Required Rerun command ci/prow/e2e-gke-parallel-clusterip cf03afe link true
/test e2e-gke-parallel-clusterip
Full PR test history. Your PR dashboard.
tls test flake, can't possibly be related? https://github.com/scylladb/scylla-operator/issues/2096#issuecomment-2425801644 /retest
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: rzetelskik, tnozicka, zimnx
The full list of commands accepted by this bot can be found here.
The pull request process is described here
@rzetelskik: The following test failed, say
/retest
to rerun all failed tests or/retest-required
to rerun all mandatory failed tests:Test name Commit Details Required Rerun command ci/prow/e2e-gke-parallel-clusterip cf03afe link true
/test e2e-gke-parallel-clusterip
Full PR test history. Your PR dashboard.
/test images /retest
Description of your changes: Currently, when the manager controller fails to save manager's cluster ID in ScyllaCluster's status after cluster's registration with manager, the cluster is deleted and recreated again. As update conflicts are not a rare occurrence, this often causes many unnecessary recreation attempts. To make the reconciliation more robust, this PR changes this behaviour. Instead of using the ID from status, labels from manager state are used. A cluster is created with a label holding the owner's UID, which allows us to maintain and recognize cluster's identity without relying on the status of our API resources. In turn clusters are only deleted when the owner UID labels is not matching the UID of the current owner, in order to avoid name collisions.
The labels are also extended with a managed hash label to align the cluster update logic with changes recently introduced in https://github.com/scylladb/scylla-operator/pull/2142.
The logic related to creating "actions" is modified to produce one cluster-related action at once and requeue in order to only schedule any further actions on next iteration. The reasoning behind it is to try avoiding errors related to task actions in case of a required cluster action, e.g. when auth token needs to be updated first.
Additionally, the manager state computed in each reconciliation loop is reduced to only one cluster, since cluster names in manager are unique and propagating additional clusters to the state is redundant.
Unit tests are also extended to cover these scenarios and unified for consistency.
Which issue is resolved by this Pull Request: Resolves #1902
/kind bug /priority important-soon /cc