Open Sunidhi-Gaonkar1 opened 1 year ago
There should be some sort of indication from Kubernetes what caused the termination of a pod. We have no special handling for killing a pod (especially with index 0). The only reason we would kill a pod is if the Cassandra container itself becomes stuck (as in, loses readiness).
Other than that, it would require rolling restart / decommission to start deleting pods. But in those cases, the order would be different.
I recommend checking the logs of containers to identify if there's a reason why Cassandra is failing, or if Kubernetes has reasons to delete the pod otherwise (like rescheduling).
Thank you for the pointer! The Cassandra logs have no error specifying why the container is failing, attaching the logs below for your reference. cassandra-container-logs.txt
The cassandra container's log could tell if the /drain endpoint was called (it's a shutdown hook for the pod) or if the shutdown came from other source. If it's the shutdown hook, then something should indicate why it was shutdown.
Did cass-operator logs have any indications? If it did kill the pod, it should log why.
I checked the cass-operator logs and found this error for the 0th index pod:
2023-08-23T11:18:34.200Z INFO client::callNodeMgmtEndpoint {"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cassandra","namespace":"k8ssandra"}, "namespace": "k8ssandra", "name": "cassandra", "reconcileID": "cb182f6e-a759-4ad2-811b-1e9ec3812cba"}
2023-08-23T11:18:34.200Z DEBUG events Starting Cassandra for pod cassandra-default-sts-0 {"type": "Normal", "object": {"kind":"CassandraDatacenter","namespace":"k8ssandra","name":"cassandra","uid":"21cca51b-d70f-47a3-8086-0ef3416cf6a4","apiVersion":"cassandra.datastax.com/v1beta1","resourceVersion":"2114359"}, "reason": "StartingCassandra"}
2023-08-23T11:18:34.215Z INFO Failed to start pod cassandra-default-sts-0, deleting it {"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cassandra","namespace":"k8ssandra"}, "namespace": "k8ssandra", "name": "cassandra", "reconcileID": "cb182f6e-a759-4ad2-811b-1e9ec3812cba", "reason": "StartingCassandra", "eventType": "Warning"}
2023-08-23T11:18:34.215Z ERROR controllers.CassandraDatacenter calculateReconciliationActions returned an error {"cassandradatacenter": "k8ssandra/cassandra", "requestNamespace": "k8ssandra", "requestName": "cassandra", "loopID": "94017e6a-a0ad-4314-a6e3-add6a7b32302", "error": "Post \"http://10.254.22.169:8080/api/v0/lifecycle/start\": dial tcp 10.254.22.169:8080: connect: connection refused"}
github.com/k8ssandra/cass-operator/controllers/cassandra.(*CassandraDatacenterReconciler).Reconcile
/workspace/controllers/cassandra/cassandradatacenter_controller.go:145
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:121
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:320
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:273
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:234
2023-08-23T11:18:34.215Z INFO Post "http://10.254.22.169:8080/api/v0/lifecycle/start": dial tcp 10.254.22.169:8080: connect: connection refused {"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cassandra","namespace":"k8ssandra"}, "namespace": "k8ssandra", "name": "cassandra", "reconcileID": "cb182f6e-a759-4ad2-811b-1e9ec3812cba", "reason": "ReconcileFailed", "eventType": "Warning"}
2023-08-23T11:18:34.215Z INFO controllers.CassandraDatacenter Reconcile loop completed {"cassandradatacenter": "k8ssandra/cassandra", "requestNamespace": "k8ssandra", "requestName": "cassandra", "loopID": "94017e6a-a0ad-4314-a6e3-add6a7b32302", "duration": 0.018262743}
2023-08-23T11:18:34.215Z ERROR Reconciler error {"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cassandra","namespace":"k8ssandra"}, "namespace": "k8ssandra", "name": "cassandra", "reconcileID": "cb182f6e-a759-4ad2-811b-1e9ec3812cba", "error": "Post \"http://10.254.22.169:8080/api/v0/lifecycle/start\": dial tcp 10.254.22.169:8080: connect: connection refused"}
There is the reason:
2023-08-23T11:18:34.215Z INFO Post "http://10.254.22.169:8080/api/v0/lifecycle/start": dial tcp 10.254.22.169:8080: connect: connection refused
The management-api could not be contacted for some reason (perhaps the cassandra-container logs would tell something, that includes management-api logs, server-system-logger container is the Cassandra itself).
Hi Team , We are working on deploying cass-operator(v1.14.0) and CassandraDataCenter on Power architecture. We are able to deploy the operator successfully, while deploying the CassandraDataCenter the pod with 0th index terminates repeatedly while the pods with 1st and 2nd index are running fine. We have installed the operator using Helm chart.
Any pointers regarding this will be helpful, Thank you.
┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: CASS-18