k8ssandra / cass-operator

The DataStax Kubernetes Operator for Apache Cassandra
https://docs.datastax.com/en/cass-operator/doc/cass-operator/cassOperatorGettingStarted.html
Apache License 2.0
189 stars 66 forks source link

K8SSAND-1698 ⁃ cass-operator can stop several nodes at the same time during a rolling restart #382

Closed adejanovski closed 5 months ago

adejanovski commented 2 years ago

What happened? After requesting a rolling restart on a datacenter with 3 Cassandra nodes, cass-operator restarts the -sts-2 pod and sometimes a few seconds later -sts-1 gets terminated by cass-operator, making two replicas unavailable in the rack and lowering availability.

Did you expect to see something different? cass-operator should make it so that restarting pods gets delayed to avoid too much sensitivity, and take into account other down nodes to evaluate what can be safely done or not.

How to reproduce it (as minimally and precisely as possible): Request a rolling restart on a cluster. This doesn't happen everytime though.

Environment

apiVersion: cassandra.datastax.com/v1beta1
kind: CassandraDatacenter
metadata:
  annotations:
    k8ssandra.io/resource-hash: A8dzZIjuvAAoVvW4lKyLYxNPuFWlDc4xdbx8F4+0IB4=
  creationTimestamp: '2022-02-17T14:09:22Z'
  finalizers:
    - finalizer.cassandra.datastax.com
  generation: 101
  labels:
    app.kubernetes.io/component: cassandra
    app.kubernetes.io/created-by: k8ssandracluster-controller
    app.kubernetes.io/name: k8ssandra-operator
    app.kubernetes.io/part-of: k8ssandra
    k8ssandra.io/cluster-name: dogfood
    k8ssandra.io/cluster-namespace: k8ssandra-operator
  name: dc2
  namespace: k8ssandra-operator
  resourceVersion: '416324174'
  uid: b2cd8c7c-fbcf-45c6-be36-1943cc3f1f91
  selfLink: >-
    /apis/cassandra.datastax.com/v1beta1/namespaces/k8ssandra-operator/cassandradatacenters/dc2
status:
  cassandraOperatorProgress: Ready
  conditions:
    - lastTransitionTime: '2022-05-17T15:37:15Z'
      message: ''
      reason: ''
      status: 'False'
      type: Stopped
    - lastTransitionTime: '2022-02-17T14:26:55Z'
      message: ''
      reason: ''
      status: 'False'
      type: ReplacingNodes
    - lastTransitionTime: '2022-07-22T07:29:39Z'
      message: ''
      reason: ''
      status: 'False'
      type: Updating
    - lastTransitionTime: '2022-07-26T08:34:53Z'
      message: ''
      reason: ''
      status: 'False'
      type: RollingRestart
    - lastTransitionTime: '2022-05-17T15:44:07Z'
      message: ''
      reason: ''
      status: 'False'
      type: Resuming
    - lastTransitionTime: '2022-02-17T14:26:55Z'
      message: ''
      reason: ''
      status: 'False'
      type: ScalingDown
    - lastTransitionTime: '2022-02-17T14:26:55Z'
      message: ''
      reason: ''
      status: 'True'
      type: Valid
    - lastTransitionTime: '2022-02-17T14:26:56Z'
      message: ''
      reason: ''
      status: 'True'
      type: Initialized
    - lastTransitionTime: '2022-05-17T15:44:08Z'
      message: ''
      reason: ''
      status: 'True'
      type: Ready
    - lastTransitionTime: '2022-07-21T12:00:19Z'
      message: ''
      reason: ''
      status: 'True'
      type: Healthy
  lastRollingRestart: '2022-07-26T08:29:59Z'
  lastServerNodeStarted: '2022-07-26T08:34:13Z'
  nodeStatuses:
    dogfood-dc2-default-sts-0:
      hostID: 6adc7220-4067-4bb4-9612-71c0fc0b52c8
    dogfood-dc2-default-sts-1:
      hostID: 7ea10675-aead-44a1-990f-281b17e24e13
    dogfood-dc2-default-sts-2:
      hostID: 555fbf43-a7a8-44ed-9799-da1108f5f782
  observedGeneration: 99
  quietPeriod: '2022-07-26T14:51:18Z'
  superUserUpserted: '2022-07-26T14:51:13Z'
  usersUpserted: '2022-07-26T14:51:13Z'
spec:
  additionalServiceConfig:
    additionalSeedService: {}
    allpodsService: {}
    dcService: {}
    nodePortService: {}
    seedService: {}
  clusterName: dogfood
  config:
    cassandra-env-sh:
      additional-jvm-opts:
        - '-Dcassandra.allow_alter_rf_during_range_movement=true'
        - '-Dcassandra.system_distributed_replication=dc1:3,dc2:3'
        - '-Dcom.sun.management.jmxremote.authenticate=true'
    cassandra-yaml:
      authenticator: PasswordAuthenticator
      authorizer: CassandraAuthorizer
      num_tokens: 16
      role_manager: CassandraRoleManager
    jvm-server-options:
      initial_heap_size: 524288000
      max_heap_size: 524288000
  configBuilderResources: {}
  managementApiAuth: {}
  podTemplateSpec:
    metadata: {}
    spec:
      containers:
        - env:
            - name: LOCAL_JMX
              value: 'no'
            - name: METRIC_FILTERS
              value: >-
                deny:org.apache.cassandra.metrics.Table
                deny:org.apache.cassandra.metrics.table
                allow:org.apache.cassandra.metrics.table.live_ss_table_count
                allow:org.apache.cassandra.metrics.Table.LiveSSTableCount
                allow:org.apache.cassandra.metrics.table.live_disk_space_used
                allow:org.apache.cassandra.metrics.table.LiveDiskSpaceUsed
                allow:org.apache.cassandra.metrics.Table.Pending
                allow:org.apache.cassandra.metrics.Table.Memtable
                allow:org.apache.cassandra.metrics.Table.Compaction
                allow:org.apache.cassandra.metrics.table.read
                allow:org.apache.cassandra.metrics.table.write
                allow:org.apache.cassandra.metrics.table.range
                allow:org.apache.cassandra.metrics.table.coordinator
                allow:org.apache.cassandra.metrics.table.dropped_mutations
            - name: MANAGEMENT_API_HEAP_SIZE
              value: '67108864'
          name: cassandra
          resources: {}
        - env:
            - name: MEDUSA_MODE
              value: GRPC
            - name: MEDUSA_TMP_DIR
              value: /var/lib/cassandra
            - name: CQL_USERNAME
              valueFrom:
                secretKeyRef:
                  key: username
                  name: dogfood-reaper-secret
            - name: CQL_PASSWORD
              valueFrom:
                secretKeyRef:
                  key: password
                  name: dogfood-reaper-secret
          image: docker.io/k8ssandra/medusa:0.13.4
          imagePullPolicy: IfNotPresent
          name: medusa
          ports:
            - containerPort: 50051
              name: grpc
              protocol: TCP
          resources:
            limits:
              memory: 8Gi
            requests:
              cpu: 100m
              memory: 100Mi
          volumeMounts:
            - mountPath: /etc/cassandra
              name: server-config
            - mountPath: /var/lib/cassandra
              name: server-data
            - mountPath: /etc/medusa
              name: dogfood-medusa
            - mountPath: /etc/podinfo
              name: podinfo
            - mountPath: /etc/medusa-secrets
              name: medusa-bucket-key
      initContainers:
        - args:
            - /bin/sh
            - '-c'
            - >-
              echo "$SUPERUSER_JMX_USERNAME $SUPERUSER_JMX_PASSWORD" >>
              /config/jmxremote.password && echo "$REAPER_JMX_USERNAME
              $REAPER_JMX_PASSWORD" >> /config/jmxremote.password
          env:
            - name: SUPERUSER_JMX_USERNAME
              valueFrom:
                secretKeyRef:
                  key: username
                  name: dogfood-superuser-secret
            - name: SUPERUSER_JMX_PASSWORD
              valueFrom:
                secretKeyRef:
                  key: password
                  name: dogfood-superuser-secret
            - name: REAPER_JMX_USERNAME
              valueFrom:
                secretKeyRef:
                  key: username
                  name: dogfood-reaper-jmx-secret
            - name: REAPER_JMX_PASSWORD
              valueFrom:
                secretKeyRef:
                  key: password
                  name: dogfood-reaper-jmx-secret
          image: docker.io/library/busybox:1.34.1
          imagePullPolicy: IfNotPresent
          name: jmx-credentials
          resources: {}
          volumeMounts:
            - mountPath: /config
              name: server-config
        - name: server-config-init
          resources: {}
        - env:
            - name: MEDUSA_MODE
              value: RESTORE
            - name: MEDUSA_TMP_DIR
              value: /var/lib/cassandra
            - name: CQL_USERNAME
              valueFrom:
                secretKeyRef:
                  key: username
                  name: dogfood-reaper-secret
            - name: CQL_PASSWORD
              valueFrom:
                secretKeyRef:
                  key: password
                  name: dogfood-reaper-secret
            - name: BACKUP_NAME
              value: medusa-backup-20220517-1
            - name: RESTORE_KEY
              value: 61be3cb6-f8d3-47c1-a5e2-169823c0f9f2
          image: docker.io/k8ssandra/medusa:0.13.4
          imagePullPolicy: IfNotPresent
          name: medusa-restore
          resources:
            limits:
              memory: 8Gi
            requests:
              cpu: 100m
              memory: 100Mi
          volumeMounts:
            - mountPath: /etc/cassandra
              name: server-config
            - mountPath: /var/lib/cassandra
              name: server-data
            - mountPath: /etc/medusa
              name: dogfood-medusa
            - mountPath: /etc/podinfo
              name: podinfo
            - mountPath: /etc/medusa-secrets
              name: medusa-bucket-key
      volumes:
        - configMap:
            name: dogfood-medusa
          name: dogfood-medusa
        - name: medusa-bucket-key
          secret:
            secretName: medusa-bucket-key
        - downwardAPI:
            items:
              - fieldRef:
                  fieldPath: metadata.labels
                path: labels
          name: podinfo
  resources:
    requests:
      memory: 2Gi
  serverType: cassandra
  serverVersion: 4.0.3
  size: 3
  storageConfig:
    cassandraDataVolumeClaimSpec:
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 50Gi
      storageClassName: standard
  superuserSecretName: dogfood-superuser-secret
  systemLoggerResources: {}
  tolerations:
    - effect: NoSchedule
      key: k8ssandra-version
      operator: Equal
      value: 2.x
  users:
    - secretName: dogfood-reaper-secret
      superuser: true
    - secretName: dogfood-reaper-secret
      superuser: true
1.6588242052190342e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    Restarting Cassandra for pod dogfood-dc2-default-sts-2  {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "reason": "RestartingCassandra", "eventType": "Normal"}
1.6588242052191195e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    calling Management API drain node - POST /api/v0/ops/node/drain {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "pod": "dogfood-dc2-default-sts-2"}
1.6588242052191548e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    client::callNodeMgmtEndpoint    {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator"}
1.6588242052195036e+09  DEBUG   events  Normal  {"object": {"kind":"CassandraDatacenter","namespace":"k8ssandra-operator","name":"dc2","uid":"b2cd8c7c-fbcf-45c6-be36-1943cc3f1f91","apiVersion":"cassandra.datastax.com/v1beta1","resourceVersion":"416000148"}, "reason": "RestartingCassandra", "message": "Restarting Cassandra for pod dogfood-dc2-default-sts-2"}
1.6588242161249676e+09  INFO    controllers.CassandraDatacenter Reconcile loop completed    {"cassandradatacenter": "k8ssandra-operator/dc2", "requestNamespace": "k8ssandra-operator", "requestName": "dc2", "loopID": "79913910-38ef-4b25-810f-12005d1bbd31", "duration": 10.956039281}
1.6588242161250792e+09  INFO    controllers.CassandraDatacenter ======== handler::Reconcile has been called {"cassandradatacenter": "k8ssandra-operator/dc2", "requestNamespace": "k8ssandra-operator", "requestName": "dc2", "loopID": "9a0d915f-d4e3-4d1e-a17e-b4ed92333406"}
1.6588242161251044e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    handler::CreateReconciliationContext    {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator"}
1.6588242161255727e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    handler::calculateReconciliationActions {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242161256244e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconcile_services::ReconcileHeadlessServices   {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242161262634e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconcile_endpoints::CheckAdditionalSeedEndpoints   {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242161262882e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconcile_racks::calculateRackInformation   {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242161263132e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconciliationContext::reconcileAllRacks    {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242161263268e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconcile_racks::listPods   {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242161268623e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    requesting Cassandra metadata endpoints from Node Management API    {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "pod": "dogfood-dc2-default-sts-2"}
1.6588242161268892e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    client::callNodeMgmtEndpoint    {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator"}
1.658824216134607e+09   INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconcile_racks::CheckConfigSecret  {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242161346457e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconcile_racks::CheckRackCreation  {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242161346512e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconcile_racks::getStatefulSetForRack  {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.658824216134793e+09   INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconcile_racks::CheckRackLabels    {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242161350152e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconcile_racks::CheckSuperuserSecretCreation   {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242161350894e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconcile_racks::CheckInternodeCredentialCreation   {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242161351364e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    starting CheckRackForceUpgrade()    {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242161351483e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconcile_racks::CheckRackScale {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.658824216135153e+09   INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconcile_racks::CheckPodsReady {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242161351576e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconcile_racks::findStartedNotReadyNodes   {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162215562e+09  INFO    controllers.CassandraDatacenter Reconcile loop completed    {"cassandradatacenter": "k8ssandra-operator/dc2", "requestNamespace": "k8ssandra-operator", "requestName": "dc2", "loopID": "9a0d915f-d4e3-4d1e-a17e-b4ed92333406", "duration": 0.096493093}
1.658824216222452e+09   INFO    controllers.CassandraDatacenter ======== handler::Reconcile has been called {"cassandradatacenter": "k8ssandra-operator/dc2", "requestNamespace": "k8ssandra-operator", "requestName": "dc2", "loopID": "7eb8f2e1-d802-4dc8-a81e-5fff8d72fae8"}
1.6588242162224867e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    handler::CreateReconciliationContext    {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator"}
1.6588242162232192e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    handler::calculateReconciliationActions {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162232454e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconcile_services::ReconcileHeadlessServices   {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.658824216223856e+09   INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconcile_endpoints::CheckAdditionalSeedEndpoints   {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162238789e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconcile_racks::calculateRackInformation   {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162238858e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconciliationContext::reconcileAllRacks    {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162238944e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconcile_racks::listPods   {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.658824216224577e+09   INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    requesting Cassandra metadata endpoints from Node Management API    {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "pod": "dogfood-dc2-default-sts-2"}
1.6588242162246015e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    client::callNodeMgmtEndpoint    {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator"}
1.6588242162314386e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconcile_racks::CheckConfigSecret  {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.658824216231472e+09   INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconcile_racks::CheckRackCreation  {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162314782e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconcile_racks::getStatefulSetForRack  {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162325387e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconcile_racks::CheckRackLabels    {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162330978e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconcile_racks::CheckSuperuserSecretCreation   {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162331574e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconcile_racks::CheckInternodeCredentialCreation   {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162336335e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    starting CheckRackForceUpgrade()    {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162336566e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconcile_racks::CheckRackScale {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162336626e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconcile_racks::CheckPodsReady {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162336743e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconcile_racks::findStartedNotReadyNodes   {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162336845e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    reconcile_racks::deleteStuckNodes   {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162336988e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    Deleting stuck pod: dogfood-dc2-default-sts-1. Reason: Pod got stuck after Cassandra container terminated   {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "namespace": "k8ssandra-operator", "datacenterName": "dc2", "clusterName": "dogfood"}
1.6588242162337089e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    Pod got stuck after Cassandra container terminated  {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "reason": "DeletingStuckPod", "eventType": "Warning"}
1.6588242162366488e+09  DEBUG   events  Warning {"object": {"kind":"CassandraDatacenter","namespace":"k8ssandra-operator","name":"dc2","uid":"b2cd8c7c-fbcf-45c6-be36-1943cc3f1f91","apiVersion":"cassandra.datastax.com/v1beta1","resourceVersion":"416000148"}, "reason": "DeletingStuckPod", "message": "Pod got stuck after Cassandra container terminated"}
1.6588242162792397e+09  ERROR   controllers.CassandraDatacenter calculateReconciliationActions returned an error    {"cassandradatacenter": "k8ssandra-operator/dc2", "requestNamespace": "k8ssandra-operator", "requestName": "dc2", "loopID": "7eb8f2e1-d802-4dc8-a81e-5fff8d72fae8", "error": "pods \"dogfood-dc2-default-sts-1\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227
1.6588242163423305e+09  INFO    controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    pods "dogfood-dc2-default-sts-1" not found  {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "reason": "ReconcileFailed", "eventType": "Warning"}
1.6588242163423748e+09  INFO    controllers.CassandraDatacenter Reconcile loop completed    {"cassandradatacenter": "k8ssandra-operator/dc2", "requestNamespace": "k8ssandra-operator", "requestName": "dc2", "loopID": "7eb8f2e1-d802-4dc8-a81e-5fff8d72fae8", "duration": 0.11994769}
1.6588242163424113e+09  ERROR   controllers.CassandraDatacenter.cassandradatacenter_controller.controller.cassandradatacenter-controller    Reconciler error    {"reconciler group": "cassandra.datastax.com", "reconciler kind": "CassandraDatacenter", "name": "dc2", "namespace": "k8ssandra-operator", "error": "pods \"dogfood-dc2-default-sts-1\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227
1.6588242163424983e+09  INFO    controllers.CassandraDatacenter ======== handler::Reconcile has been called {"cassandradatacenter": "k8ssandra-operator/dc2", "requestNamespace": "k8ssandra-operator", "requestName": "dc2", "loopID": "f7d49596-dd9e-4c88-bf95-1b268da56449"}

Anything else we need to know?:

┆Issue is synchronized with this Jira Task by Unito ┆friendlyId: K8SSAND-1698 ┆priority: Medium

burmanm commented 2 years ago

Are you sure the pods are actually getting restarted correctly? The logs indicate the event: Deleting stuck pod: dogfood-dc2-default-sts-1. Reason: Pod got stuck after Cassandra container terminated.

And this isn't very fast operation, that kill reason requires the -sts-1 pod's cassandra container to have been terminated for 10 minutes.

What is preventing the pod from restarting once cassandra container has died? One of the containers is still alive after cassandra container was killed, was it medusa or busybox (jmx-credentials) ?

adejanovski commented 2 years ago

Are you sure the pods are actually getting restarted correctly?

What do you mean by that? Everything starts with a rolling restart where -sts-2 gets restarted, but followed too quickly by -sts-1. I can assure you that only a few seconds have passed between these restarts.

What is preventing the pod from restarting once cassandra container has died? One of the containers is still alive after cassandra container was killed, was it medusa or busybox (jmx-credentials) ?

Could be medusa indeed, it is deployed on this cluster.

burmanm commented 2 years ago

What do you mean by that? Everything starts with a rolling restart where -sts-2 gets restarted, but followed too quickly by -sts-1. I can assure you that only a few seconds have passed between these restarts.

That's not what the logs you pasted said. It does not say anything about restarting -sts-1, it's not the rolling restart process that caused the -sts-1 to be restarted in this case.

It is triggering this code for -sts-1: https://github.com/k8ssandra/cass-operator/blob/fd79c991396ec80546786e28a8a8697e21cd886d/pkg/reconciliation/reconcile_racks.go#L1284

And that means Kubernetes has reported the -sts-1 has had cassandra container dead for 10 minutes. The actual rolling restart logs another line, which is not where your logs are pointing at (indicating that either that pod was never restarted by cass-operator or that the log is not the entire log, but a snippet telling incomplete story).

"Restarting Cassandra for pod %s", pod.Name is an event it would create when rolling restart process is triggered. But we only see that for -sts-2 in the logs, -sts-1 and -sts-0 were never part of that process in that log.