Use Scylla Manager cluster labels for cluster reconciliation

rzetelskik commented 1 month ago

Description of your changes: Currently, when the manager controller fails to save manager's cluster ID in ScyllaCluster's status after cluster's registration with manager, the cluster is deleted and recreated again. As update conflicts are not a rare occurrence, this often causes many unnecessary recreation attempts. To make the reconciliation more robust, this PR changes this behaviour. Instead of using the ID from status, labels from manager state are used. A cluster is created with a label holding the owner's UID, which allows us to maintain and recognize cluster's identity without relying on the status of our API resources. In turn clusters are only deleted when the owner UID labels is not matching the UID of the current owner, in order to avoid name collisions.

The labels are also extended with a managed hash label to align the cluster update logic with changes recently introduced in https://github.com/scylladb/scylla-operator/pull/2142.

The logic related to creating "actions" is modified to produce one cluster-related action at once and requeue in order to only schedule any further actions on next iteration. The reasoning behind it is to try avoiding errors related to task actions in case of a required cluster action, e.g. when auth token needs to be updated first.

Additionally, the manager state computed in each reconciliation loop is reduced to only one cluster, since cluster names in manager are unique and propagating additional clusters to the state is redundant.

Unit tests are also extended to cover these scenarios and unified for consistency.

Which issue is resolved by this Pull Request: Resolves #1902

/kind bug /priority important-soon /cc

scylla-operator-bot[bot] commented 1 month ago

@rzetelskik: GitHub didn't allow me to request PR reviews from the following users: rzetelskik.

Note that only scylladb members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to [this](https://github.com/scylladb/scylla-operator/pull/2156): > > >**Description of your changes:** wip > >**Which issue is resolved by this Pull Request:** >Resolves #1902 > >/kind bug >/priority important-soon >/cc > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

rzetelskik commented 1 month ago

/cc zimnx tnozicka

rzetelskik commented 1 month ago

@rzetelskik: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command ci/prow/e2e-gke-parallel-clusterip cf03afe link true /test e2e-gke-parallel-clusterip Full PR test history. Your PR dashboard.

cluster provisioning failed /retest

rzetelskik commented 1 month ago

@rzetelskik: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command ci/prow/e2e-gke-parallel-clusterip cf03afe link true /test e2e-gke-parallel-clusterip Full PR test history. Your PR dashboard.

tls test flake, can't possibly be related? https://github.com/scylladb/scylla-operator/issues/2096#issuecomment-2425801644 /retest

scylla-operator-bot[bot] commented 1 month ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rzetelskik, tnozicka, zimnx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/scylladb/scylla-operator/blob/master/OWNERS)~~ [tnozicka,zimnx] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

rzetelskik commented 1 month ago

@rzetelskik: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command ci/prow/e2e-gke-parallel-clusterip cf03afe link true /test e2e-gke-parallel-clusterip Full PR test history. Your PR dashboard.

/test images /retest

scylladb / scylla-operator

Use Scylla Manager cluster labels for cluster reconciliation #2156