Timebox the reconciler cycle to be able to fix deployment via the CRD rather than by direct update

andrey-dubnik commented 2 years ago

What is missing?

Currently when there is a change in the K8ssandraCluster configuration breaking the deployment e.g. excessive CPU request prevents POD from booting leaving it in Pending, Operator does not apply a newly updated K8ssandraCluster definition with a fix. To fix the problem we have to update the STS definition to reduce the CPU requested so POD could start and only after that Operator reconciles the new configuration.

Why do we need it?

There are situations where POD won't boot and we like to fix it via the K8ssandraCluster CRD

Environment

K8ssandra Operator version: v1.2.1

┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: K8OP-218

burmanm commented 2 years ago

For context, in cass-operator there's forceRackUpgrade parameter that allows fixing this type of issues. Perhaps we should look into something similar (but not that approach since then the operator modifies the spec).

jsanda commented 2 years ago

@andrey-dubnik Since you mentioned having to update the STS I assume you are referring to Cassandra pods. In this scenario, k8ssandra-operator should apply the change to the underlying CassandraDatacenter. cass-operator however will not apply the changes to the STS until all the Cassandra pods are in the ready state. That's been the behavior in cass-operator for as long as I have been involved with the project. With that said, I am not a fan of it and think we should consider changing it. @burmanm wdyt?

k8ssandra / k8ssandra-operator

Timebox the reconciler cycle to be able to fix deployment via the CRD rather than by direct update #742