scylladb / scylla-operator

The Kubernetes Operator for ScyllaDB
https://operator.docs.scylladb.com/
Apache License 2.0
332 stars 162 forks source link

Manager controller recreates clusters when manager cluster ID is missing from status #1902

Open rzetelskik opened 5 months ago

rzetelskik commented 5 months ago

What happened?

Currently, the manager cluster ID is saved in ScyllaCluster's status on cluster creation. If the controller fails to update ScyllaCluster's status, the ID is lost, or an older generation of the object is reconciled, the controller will delete the existing cluster from the manager state and create it again.

The issue and its root cause are similar https://github.com/scylladb/scylla-operator/issues/1752.

This not only adds a superfluous workload, but may introduce incorrectness, involving e.g. task retention.

/priority important-soon /assign

What did you expect to happen?

Clusters in manager state should not be deleted once they've been created successfully.

How can we reproduce it (as minimally and precisely as possible)?

n/a

Scylla Operator version

master

Kubernetes platform name and version

n/a

Please attach the must-gather archive.

n/a

Anything else we need to know?

Unfortunately, we now have no reliable way of telling whether a cluster existing in manager state corresponds to a K8S object if we don't have the ID. This should be easy to fix with https://github.com/scylladb/scylla-manager/issues/3219, since we'll be able to save metadata in manager state, and so we'll be able to "reclaim" the cluster despite not having its ID.

scylla-operator-bot[bot] commented 2 months ago

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

/lifecycle stale

scylla-operator-bot[bot] commented 1 month ago

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

/lifecycle rotten

rzetelskik commented 1 month ago

/remove-lifecycle rotten /triage accepted