k8ssandra / cass-operator

The DataStax Kubernetes Operator for Apache Cassandra
https://docs.datastax.com/en/cass-operator/doc/cass-operator/cassOperatorGettingStarted.html
Apache License 2.0
181 stars 65 forks source link

many properties in CR are fragile, easy to result in broken cluster #518

Open hoyhbx opened 1 year ago

hoyhbx commented 1 year ago

What is missing?

Hello cass-operator developers,

We found that many properties in the CR are very easy to drive the cluster into a broken state if not handled carefully. For example, when specifying a bad value for the properties in spec.podTemplateSpec, it will drive the cluster at least have one replica unavailable. Other examples include spec.nodeSelector, spec.tolerations.

A concrete example is to submit the following CR

spec:
  podTemplateSpec:
    spec:
      containers:
      - name: container-name
...

cass-operator updates the statefulSet which triggers a rolling update, but the new Pod cannot be created, rejected by the statefulSet controller with error message:

StatefulSet cluster1-test-cluster-default-sts failed error: Pod "cluster1-test-cluster-default-sts-2" is invalid: [spec.containers[0].image: Required value

It causes severe consequences in production. We believe these are misoperation vulnerabilities where the operator fails to reject the misoperation from users. The operator uses controller-gen to automatically generate validations for many properties, but these static validations fall short for validating more complicated contraints, e.g. to reject an invalid nodeSelector needs knowledge of which nodes are available in Kubernetes cluster.

We want to open this issue to discuss what do you think should be the best practice to handle this issue, or what functionalities should the Kubernetes provide to make this validation easier. Is there a way to prevent the bad operation from happening in the first place, or there is a way for cass-operator to automatically recognize the statefulSet is stuck and perform an automatic recovery. If you know of any practical code fixes for this issue, we are also happy to send a PR for that.

Why is this needed?

Operator is vulnerable to misoperations, which can cause severe consequences in production.

┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: CASS-26

burmanm commented 1 year ago

I'll try to address some points in here. First of all, we don't encourage using PodTemplateSpec, as there's a lot of things - as you noticed - that could break things. Not to mention we don't even use all the fields from PodTemplateSpec, so the behavior is unspecified often.

The other thing is, if the PodTemplateSpec has missing validations then those fixes should be probably be done in the Kubernetes project and not here. We should not actively try to fight against Kubernetes' validations as that might make things incompatible in the future versions.

If even StatefulSet controller allows to deploy stuff that it can't deploy, then even it has an issue. So fixing infrastructure layer issues here is really not the best option, the fixes should be done in the correct layer and that would benefit multiple projects.

As for degrading, yes - it would take down a single node in this case, but I don't think that's driving the cluster itself to a broken state. When running large distributed applications, it's not uncommon to see a single node down. Not to mention, there's no difficulty in healing from this situation as just by fixing the CassandraDatacenterSpec you would be back to normal running operations.

cass-operator itself can't rollback the CassandraDatacenterSpec, the operator should stay away from modifying the CRD (we've actively refactored features that did this), since some other application might want to maintain the state of that CRD file, such as Ansible or ArgoCD or other orchestration application. The user would notice from the status of CassandraDatacenter that deploying changes has failed, so this shouldn't go unnoticed.

I think one of the issues you're refering to is here: https://github.com/kubernetes/kubernetes/issues/67250 if that would be fixed, then this issue is fixed also, other than perhaps our status might be wrong in that case. Sadly, I think this type of behavior is broken in Kubernetes itself and could use some help.

olim7t commented 1 year ago

@burmanm is there anything we want to follow up on, or should this be closed?

we don't encourage using PodTemplateSpec, as there's a lot of things - as you noticed - that could break things. Not to mention we don't even use all the fields from PodTemplateSpec, so the behavior is unspecified often.

It would be nice to have those explanations in the PodTemplateSpec comment, with a list of which fields are used.