Closed adejanovski closed 5 months ago
Are you sure the pods are actually getting restarted correctly? The logs indicate the event: Deleting stuck pod: dogfood-dc2-default-sts-1. Reason: Pod got stuck after Cassandra container terminated
.
And this isn't very fast operation, that kill reason requires the -sts-1 pod's cassandra container to have been terminated for 10 minutes.
What is preventing the pod from restarting once cassandra container has died? One of the containers is still alive after cassandra container was killed, was it medusa or busybox (jmx-credentials) ?
Are you sure the pods are actually getting restarted correctly?
What do you mean by that? Everything starts with a rolling restart where -sts-2
gets restarted, but followed too quickly by -sts-1
. I can assure you that only a few seconds have passed between these restarts.
What is preventing the pod from restarting once cassandra container has died? One of the containers is still alive after cassandra container was killed, was it medusa or busybox (jmx-credentials) ?
Could be medusa indeed, it is deployed on this cluster.
What do you mean by that? Everything starts with a rolling restart where -sts-2 gets restarted, but followed too quickly by -sts-1. I can assure you that only a few seconds have passed between these restarts.
That's not what the logs you pasted said. It does not say anything about restarting -sts-1, it's not the rolling restart process that caused the -sts-1 to be restarted in this case.
It is triggering this code for -sts-1: https://github.com/k8ssandra/cass-operator/blob/fd79c991396ec80546786e28a8a8697e21cd886d/pkg/reconciliation/reconcile_racks.go#L1284
And that means Kubernetes has reported the -sts-1 has had cassandra
container dead for 10 minutes. The actual rolling restart logs another line, which is not where your logs are pointing at (indicating that either that pod was never restarted by cass-operator or that the log is not the entire log, but a snippet telling incomplete story).
"Restarting Cassandra for pod %s", pod.Name
is an event it would create when rolling restart process is triggered. But we only see that for -sts-2 in the logs, -sts-1 and -sts-0 were never part of that process in that log.
What happened? After requesting a rolling restart on a datacenter with 3 Cassandra nodes, cass-operator restarts the
-sts-2
pod and sometimes a few seconds later-sts-1
gets terminated by cass-operator, making two replicas unavailable in the rack and lowering availability.Did you expect to see something different? cass-operator should make it so that restarting pods gets delayed to avoid too much sensitivity, and take into account other down nodes to evaluate what can be safely done or not.
How to reproduce it (as minimally and precisely as possible): Request a rolling restart on a cluster. This doesn't happen everytime though.
Environment
Cass Operator version:
* Kubernetes version information: `kubectl version` * Kubernetes cluster kind: `GKE`v1.12.0
Manifests:
Anything else we need to know?:
┆Issue is synchronized with this Jira Task by Unito ┆friendlyId: K8SSAND-1698 ┆priority: Medium