Open jsanda opened 2 years ago
Ready itself is set to true when initial startup has finished, or waking up from Stopped. It's not changed if some other operation is doing something else, but this method pretty well explains all the transformations when one could consider things "done":
https://github.com/k8ssandra/cass-operator/blob/master/pkg/reconciliation/reconcile_racks.go#L2276
However, I'd be careful with doing this type of function in k8ssandra-operator, as cass-operator updates might not follow some restrictions invented here.
And I don't think pods are something this should follow. This issue is missing one thing - why. If the DC has been started, what operations would not be possible? In one DC, cass-operator takes care of the issues, so which operations are such that the operation could be done incorrectly in multi-DC configuration?
And even more, if there are a lot of nodes, we should assume down-count to be >0 and all the operations we do should still work. If we have to assume all the nodes are always up before Cassandra operations work, we might want to fix Cassandra itself, since that's no longer very HA solution.
Please add your planning poker estimate with ZenHub @burmanm
Background
The
cassandra
package has aDatacenterReady
function that is used in various places. Here it is:The
DatacenterReady
condition can be a bit misleading because it can be true when Cassandra pods are not ready. In fact there can actually be no Cassandra pods that are ready (as determined by the cassandra container's readiness probe) while the condition is true. We see this happen with single-node clusters often in automated tests. TheDatacenterReady
condition really means that all Cassandra pods have been started and reached the ready state.I believe
DatacenterReady
is false in two scenarios (burmanm please correct me if I am wrong). It is false when the CassandraDatacenter is first created up until all pods become ready. It is also set to false when the CassandraDatacenter is stopped. That's it.The
CassandraOperatorProgress
property is set toProgressUpdating
in several scenarios when cass-operator has to work to do. It is set toProgressReady
when cass-operator gets through its whole reconciliation process and doesn't have any more work to do.What's needed
As mentioned previously, the
DatacenterReady
function is used in several places. In some cases, that is exactly what is needed. For example when the operator is provisioning a new K8ssandraCluster that consists of two CassandraDatacenters (dc1 and dc2), it makes sense to callDatacenterReady
for dc1 before creating dc2.In other situations, e.g., applying schema changes,
DatacenterReady
may not be adequate.DatacenterReady
only checks a single CassandraDatacenter, and in the case of schema changes we need consider all DCs. Currently the operator applies schems changes for Stargate and Reaper. It does so only after all CassandraDatacenters are ready according to theDatacenterReady
function.We may want a
CassandraReady
function that checks all CassandraDatacenters. In addition to checking the status of the CassandraDatacenters, we might also want to check all of the Cassandra pods and make sure that each of thecassandra
containers is ready. I'm not sure if there are situations whereDatacenterReady
could return true and acassandra
container isn't ready. This needs to be investigated. If it is possible, we definitely want to check the pods.┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: K8OP-115