k8ssandra / k8ssandra-operator

The Kubernetes operator for K8ssandra
https://k8ssandra.io/
Apache License 2.0
174 stars 79 forks source link

On AWS, Cassandra doesn't come up after restoration - "A node with address already exists" #1325

Open c3-clement opened 6 months ago

c3-clement commented 6 months ago

What happened?

On AWS (EKS and S3 Bucket), the Cassandra Statefulset won't come up sometimes after restoration. On a K8ssandra cluster with 1 dc of 3 nodes, 2 are not reaching readiness probe. image

On failed pods, the following error message is showing up in logs:

java.lang.RuntimeException: A node with address /172.0.238.69:7000 already exists, cancelling join. Use cassandra.replace_address if you want to replace this node.
    at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:749)
    at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:1024)
    at org.apache.cassandra.service.StorageService.initServer(StorageService.java:874)
    at org.apache.cassandra.service.StorageService.initServer(StorageService.java:819)
    at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:418)
    at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:759)
    at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:893)

I observer this issue many times in our AWS environments and I never seen it in our GCP and Azure environments. It seems AWS specific.

Did you expect to see something different?

I expect the Cassandra Statefulset to be up and running after restoration.

How to reproduce it (as minimally and precisely as possible):

  1. Setup S3 Bucket and EKS cluster
  2. Create k8ssandra cluster with 3 nodes and medusa enabled
  3. Do a backup using MedusaBackupJob
  4. Delete k8ssandra cluster (make sure pods and pvc are gone)
  5. Re-create the k8ssandra cluster
  6. Create a MedusaRestoreJob

The issue is not happening consistently. I would say it happens on 1 out of 5 attempts.

Environment

k8ssandra.yaml:

apiVersion: v1
items:
- apiVersion: k8ssandra.io/v1alpha1
  kind: K8ssandraCluster
  metadata:
    annotations:
      k8ssandra.io/initial-system-replication: '{"k8scass-001":3}'
    creationTimestamp: "2024-05-23T14:14:03Z"
    finalizers:
    - k8ssandracluster.k8ssandra.io/finalizer
    generation: 1
    labels:
      app: c3aiops
      ops.c3.ai/parent-resource: eksdoucy-c3cassandra-k8scass-cs-001
      role: c3aiops-C3Cassandra
    name: k8scass-cs-001
    namespace: eksdoucy
    resourceVersion: "19776927"
    uid: 73e05f81-753b-4cfb-b225-a502e118f121
  spec:
    auth: true
    cassandra:
      datacenters:
      - config:
          cassandraYaml:
            audit_logging_options:
              enabled: true
            commitlog_directory: /c3/cassandra/data/commitlog
            data_file_directories:
            - /c3/cassandra/data/data
            saved_caches_directory: /c3/cassandra/data/saved_caches
          jvmOptions:
            gc: G1GC
            heapSize: 512M
        containers:
        - args:
          - -c
          - tail -n+1 -F /c3/cassandra/logs/system.log
          command:
          - /bin/sh
          image: ci-registry.c3iot.io/bundled-cassandra-jre-bcfips:4.0.12-202404122213
          imagePullPolicy: Always
          name: server-system-logger
          resources: {}
          volumeMounts:
          - mountPath: /c3/cassandra
            name: server-data
        - args:
          - mgmtapi
          command:
          - /docker-entrypoint.sh
          name: cassandra
          resources: {}
          volumeMounts:
          - mountPath: /c3/cassandra
            name: server-data
        - env:
          - name: MEDUSA_TMP_DIR
            value: /c3/cassandra/medusa/tmp
          name: medusa
          resources: {}
          volumeMounts:
          - mountPath: /c3/cassandra
            name: server-data
        initContainers:
        - command:
          - /usr/local/bin/entrypoint
          image: ci-registry.c3iot.io/bundled-cassandra-jre-bcfips:4.0.12-202404122213
          imagePullPolicy: Always
          name: server-config-init
          resources: {}
        - env:
          - name: MEDUSA_TMP_DIR
            value: /c3/cassandra/medusa/tmp
          name: medusa-restore
          resources: {}
          volumeMounts:
          - mountPath: /c3/cassandra
            name: server-data
        jmxInitContainerImage:
          name: busybox
          registry: docker.io
          tag: 1.34.1
        metadata:
          name: k8scass-001
          pods:
            annotations:
              kubectl.kubernetes.io/last-applied-configuration: |
                {"apiVersion":"ops.c3.ai/v1","kind":"C3Cassandra","metadata":{"annotations":{},"labels":{"c3__app-0":"0c30","c3__id":"c3cassTest"},"name":"k8scass-cs-001","namespace":"eksdoucy"},"spec":{"config":{"auth":true,"cassandra":{"datacenters":[{"config":{"jvmOptions":{"gc":"G1GC","heapSize":"512M"}},"jmxInitContainerImage":{"name":"busybox","registry":"docker.io","tag":"1.34.1"},"metadata":{"name":"k8scass-001"},"perNodeConfigInitContainerImage":"mikefarah/yq:4","size":3,"stargate":{"allowStargateOnDataNodes":false,"containerImage":{"registry":"docker.io","repository":"stargateio","tag":"v1.0.45"},"heapSize":"256M","serviceAccount":"default","size":1},"stopped":false,"storageConfig":{"cassandraDataVolumeClaimSpec":{"accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"30Gi"}}}}}],"jmxInitContainerImage":{"name":"busybox","registry":"docker.io","tag":"1.34.1"},"perNodeConfigInitContainerImage":"mikefarah/yq:4","podSecurityContext":{"fsGroup":1001,"runAsGroup":1001,"runAsUser":1001},"serverType":"cassandra"}},"deleteChildResources":true,"resources":{"limits":{"cpu":2,"memory":"4Gi"},"requests":{"cpu":2,"memory":"4Gi"}},"restore":{"enabled":true,"restoreFrom":"2024-05-23T14:11:57Z"},"serverVersion":"4.0.13"}}
            labels:
              c3__app-0: 0c30
              c3__id: c3cassTest
          services:
            additionalSeedService: {}
            allPodsService: {}
            dcService: {}
            nodePortService: {}
            seedService: {}
        perNodeConfigInitContainerImage: mikefarah/yq:4
        perNodeConfigMapRef: {}
        podSecurityContext:
          fsGroup: 999
          runAsGroup: 999
          runAsUser: 999
        serviceAccount: c3cassandra
        size: 3
        stargate:
          allowStargateOnDataNodes: false
          containerImage:
            name: stargate-v1
            registry: ci-registry.c3iot.io
            repository: c3.ai
            tag: 1.0.79-r3-202405071212
          heapSize: 256M
          metadata:
            commonLabels:
              c3__app-0: 0c30
              c3__id: c3cassTest
            pods: {}
            service: {}
          secretsProvider: internal
          serviceAccount: c3cassandra
          size: 1
        stopped: false
        storageConfig:
          cassandraDataVolumeClaimSpec:
            accessModes:
            - ReadWriteOnce
            resources:
              requests:
                storage: 30Gi
      jmxInitContainerImage:
        name: busybox
        registry: docker.io
        tag: 1.34.1
      metadata:
        pods: {}
        services:
          additionalSeedService: {}
          allPodsService: {}
          dcService: {}
          nodePortService: {}
          seedService: {}
      perNodeConfigInitContainerImage: mikefarah/yq:4
      podSecurityContext:
        fsGroup: 1001
        runAsGroup: 1001
        runAsUser: 1001
      resources:
        limits:
          cpu: "2"
          memory: 4Gi
        requests:
          cpu: "2"
          memory: 4Gi
      serverType: cassandra
      serverVersion: 4.0.13
      softPodAntiAffinity: true
      superuserSecretRef: {}
      telemetry:
        mcac:
          enabled: true
          metricFilters:
          - deny:org.apache.cassandra.metrics.Table
          - deny:org.apache.cassandra.metrics.table
          - allow:org.apache.cassandra.metrics.table.live_ss_table_count
          - allow:org.apache.cassandra.metrics.Table.LiveSSTableCount
          - allow:org.apache.cassandra.metrics.table.live_disk_space_used
          - allow:org.apache.cassandra.metrics.table.LiveDiskSpaceUsed
          - allow:org.apache.cassandra.metrics.Table.Pending
          - allow:org.apache.cassandra.metrics.Table.Memtable
          - allow:org.apache.cassandra.metrics.Table.Compaction
          - allow:org.apache.cassandra.metrics.table.read
          - allow:org.apache.cassandra.metrics.table.write
          - allow:org.apache.cassandra.metrics.table.range
          - allow:org.apache.cassandra.metrics.table.coordinator
          - allow:org.apache.cassandra.metrics.table.dropped_mutations
          - allow:org.apache.cassandra.metrics.Table.TombstoneScannedHistogram
          - allow:org.apache.cassandra.metrics.table.tombstone_scanned_histogram
        prometheus:
          enabled: true
    medusa:
      cassandraUserSecretRef: {}
      certificatesSecretRef: {}
      containerImage:
        name: c3-cassandra-medusa-fips
        pullPolicy: Always
        registry: ci-registry.c3iot.io
        repository: c3.ai
        tag: 0.20.1-202404301417
      medusaConfigurationRef: {}
      storageProperties:
        bucketName: 850572970568--eksdoucy
        concurrentTransfers: 0
        credentialsType: role-based
        maxBackupAge: 7
        maxBackupCount: 0
        multiPartUploadThreshold: 104857600
        prefix: c3cassandra-backup
        region: us-east-1
        storageProvider: s3
        storageSecretRef: {}
        transferMaxBandwidth: 50MB/s
    reaper:
      ServiceAccountName: c3cassandra
      autoScheduling:
        enabled: true
        initialDelayPeriod: PT15S
        percentUnrepairedThreshold: 10
        periodBetweenPolls: PT10M
        repairType: AUTO
        scheduleSpreadPeriod: PT6H
        timeBeforeFirstSchedule: PT5M
      cassandraUserSecretRef:
        name: k8scass-cs-001-superuser
      containerImage:
        name: cassandra-reaper-jre-bcfips
        registry: ci-registry.c3iot.io
        repository: c3.ai
        tag: 3.5.0-202403142052
      deploymentMode: PER_DC
      heapSize: 2Gi
      httpManagement:
        enabled: false
      initContainerImage:
        name: cassandra-reaper-jre-bcfips
        registry: ci-registry.c3iot.io
        repository: c3.ai
        tag: 3.5.0-202403142052
      jmxUserSecretRef: {}
      keyspace: reaper_db
      metadata:
        annotations:
          kubectl.kubernetes.io/last-applied-configuration: |
            {"apiVersion":"ops.c3.ai/v1","kind":"C3Cassandra","metadata":{"annotations":{},"labels":{"c3__app-0":"0c30","c3__id":"c3cassTest"},"name":"k8scass-cs-001","namespace":"eksdoucy"},"spec":{"config":{"auth":true,"cassandra":{"datacenters":[{"config":{"jvmOptions":{"gc":"G1GC","heapSize":"512M"}},"jmxInitContainerImage":{"name":"busybox","registry":"docker.io","tag":"1.34.1"},"metadata":{"name":"k8scass-001"},"perNodeConfigInitContainerImage":"mikefarah/yq:4","size":3,"stargate":{"allowStargateOnDataNodes":false,"containerImage":{"registry":"docker.io","repository":"stargateio","tag":"v1.0.45"},"heapSize":"256M","serviceAccount":"default","size":1},"stopped":false,"storageConfig":{"cassandraDataVolumeClaimSpec":{"accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"30Gi"}}}}}],"jmxInitContainerImage":{"name":"busybox","registry":"docker.io","tag":"1.34.1"},"perNodeConfigInitContainerImage":"mikefarah/yq:4","podSecurityContext":{"fsGroup":1001,"runAsGroup":1001,"runAsUser":1001},"serverType":"cassandra"}},"deleteChildResources":true,"resources":{"limits":{"cpu":2,"memory":"4Gi"},"requests":{"cpu":2,"memory":"4Gi"}},"restore":{"enabled":true,"restoreFrom":"2024-05-23T14:11:57Z"},"serverVersion":"4.0.13"}}
        pods:
          labels:
            c3__app-0: 0c30
            c3__id: c3cassTest
        service: {}
      secretsProvider: internal
    secretsProvider: internal
  status:
    conditions:
    - lastTransitionTime: "2024-05-23T14:18:59Z"
      status: "True"
      type: CassandraInitialized
    datacenters:
      k8scass-001:
        cassandra:
          cassandraOperatorProgress: Updating
          conditions:
          - lastTransitionTime: "2024-05-23T14:16:05Z"
            message: ""
            reason: ""
            status: "True"
            type: Healthy
          - lastTransitionTime: "2024-05-23T14:19:34Z"
            message: ""
            reason: ""
            status: "False"
            type: Stopped
          - lastTransitionTime: "2024-05-23T14:16:34Z"
            message: ""
            reason: ""
            status: "False"
            type: Ready
          - lastTransitionTime: "2024-05-23T14:19:24Z"
            message: ""
            reason: ""
            status: "False"
            type: Updating
          - lastTransitionTime: "2024-05-23T14:18:39Z"
            message: ""
            reason: ""
            status: "False"
            type: ReplacingNodes
          - lastTransitionTime: "2024-05-23T14:18:39Z"
            message: ""
            reason: ""
            status: "False"
            type: RollingRestart
          - lastTransitionTime: "2024-05-23T14:19:34Z"
            message: ""
            reason: ""
            status: "True"
            type: Resuming
          - lastTransitionTime: "2024-05-23T14:18:39Z"
            message: ""
            reason: ""
            status: "False"
            type: ScalingDown
          - lastTransitionTime: "2024-05-23T14:18:39Z"
            message: ""
            reason: ""
            status: "True"
            type: Valid
          - lastTransitionTime: "2024-05-23T14:18:59Z"
            message: ""
            reason: ""
            status: "True"
            type: Initialized
          datacenterName: ""
          lastServerNodeStarted: "2024-05-23T14:21:21Z"
          nodeStatuses:
            k8scass-cs-001-k8scass-001-default-sts-0:
              hostID: d90060dc-32fd-4861-9d2e-56883976179f
          observedGeneration: 3
          quietPeriod: "2024-05-23T14:19:30Z"
    error: None
kind: List
metadata:
  resourceVersion: ""

medusarestorejob.yaml :

apiVersion: v1
items:
- apiVersion: medusa.k8ssandra.io/v1alpha1
  kind: MedusaRestoreJob
  metadata:
    creationTimestamp: "2024-05-23T14:16:03Z"
    generation: 1
    labels:
      app: c3aiops
      ops.c3.ai/c3cassandra: k8scass-cs-001
      ops.c3.ai/parent-resource: eksdoucy-c3cassandrarestore-k8scass-cs-001-nxr57
      role: c3aiops-C3CassandraRestore
    name: k8scass-cs-001-nxr57-k8scass-001
    namespace: eksdoucy
    ownerReferences:
    - apiVersion: cassandra.datastax.com/v1beta1
      blockOwnerDeletion: true
      controller: true
      kind: CassandraDatacenter
      name: k8scass-001
      uid: 2654ce97-3241-419e-ac31-7a7047962a04
    resourceVersion: "19775662"
    uid: 7a4d0465-5dba-4bd0-a17f-e1641ea95c6f
  spec:
    backup: k8scass-cs-001-k8scass-001-1716473456
    cassandraDatacenter: k8scass-001
  status:
    datacenterStopped: "2024-05-23T14:19:04Z"
    restoreKey: 119ab032-1914-4ab4-9e97-461f2f1c8a2e
    restoreMapping:
      host_map:
        k8scass-cs-001-k8scass-001-default-sts-0:
          seed: false
          source:
          - k8scass-cs-001-k8scass-001-default-sts-0
        k8scass-cs-001-k8scass-001-default-sts-1:
          seed: false
          source:
          - k8scass-cs-001-k8scass-001-default-sts-1
        k8scass-cs-001-k8scass-001-default-sts-2:
          seed: false
          source:
          - k8scass-cs-001-k8scass-001-default-sts-2
      in_place: true
    restorePrepared: true
    startTime: "2024-05-23T14:16:18Z"
kind: List
metadata:
  resourceVersion: "

medusabackup.yaml:

apiVersion: medusa.k8ssandra.io/v1alpha1
kind: MedusaBackup
metadata:
  creationTimestamp: "2024-05-23T14:11:57Z"
  generation: 1
  name: k8scass-cs-001-k8scass-001-1716473456
  namespace: eksdoucy
  resourceVersion: "19772110"
  uid: 6acba111-7afd-4c18-bb13-cea9ae9835be
spec:
  backupType: differential
  cassandraDatacenter: k8scass-001
status:
  finishTime: "2024-05-23T14:11:56Z"
  finishedNodes: 3
  nodes:
  - datacenter: k8scass-001
    host: k8scass-cs-001-k8scass-001-default-sts-0
    rack: default
    tokens:
    - -1378304883689096200
    - -2449805500029129000
    - -360453478946669200
    - -3890073729931712500
    - -4734450377422874000
    - -5825747488115086000
    - -6921789388350025000
    - -8700423689173612000
    - 1532974334891261200
    - 2268890027602154800
    - 3348262835049756700
    - 457218453665681600
    - 4591782840730348500
    - 5559147083808884000
    - 7149151636322782000
    - 8768174932019826000
  - datacenter: k8scass-001
    host: k8scass-cs-001-k8scass-001-default-sts-2
    rack: default
    tokens:
    - -1005742008666699400
    - -2068094408337051000
    - -2730225914639292400
    - -3394734616816550400
    - -4221581073554663400
    - -5464019388234796000
    - -6521576072924844000
    - -7188506821845786000
    - -8335830530201462000
    - 132384530723195550
    - 1964300259678278100
    - 2678595724075939300
    - 3795834174947636700
    - 6867557899637203000
    - 8500038863641467000
    - 9133022687114601000
  - datacenter: k8scass-001
    host: k8scass-cs-001-k8scass-001-default-sts-1
    rack: default
    tokens:
    - -1831824468126929000
    - -3090245059270582000
    - -5211939959272817000
    - -6269316476201775000
    - -7619518302776453000
    - -767483478353865600
    - -8045804804057583000
    - -9080105450732987000
    - 1052520822626080900
    - 2929585152139067000
    - 4098540075318081000
    - 5147663875647272000
    - 6083577048438788000
    - 6499214708010427000
    - 7710895054682409000
    - 8108217907031122000
  startTime: "2024-05-23T14:11:11Z"
  status: SUCCESS
  totalFiles: 1023
  totalNodes: 3
  totalSize: 1.30 MB
insert manifests relevant to the issue

Anything else we need to know?:

k8scass-cs-001-k8scass-001-default-sts-1 medusa-restore logs: medusa-restore-logs.txt

k8scass-cs-001-k8scass-001-default-sts-1 server-system-logger logs: server-system-logger.txt

┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: K8OP-22

c3-clement commented 6 months ago

Note that after the issue happens, I can delete the k8ssandra cluster, re-create it and re-create a medusa restore job. Restoration will succeeded and the 3 Cassandra pods will come up.

This is make me think that something is going wrong in the restoration rather than in the backup.

rzvoncek commented 5 months ago

Hello @c3-clement! I've spent some time trying to reproduce this using AWS EKS cluster, but with minio instead of S3. I'm sorry to conculde I did not manage to reproduce the issue.

I only saw two things that were a bit odd. First, after restore, the new nodes attempted to gossip with one or two of the old ones. This can be explained by the operator restoring the system.peers table. Second, I did see two more conditions for my datacenter (the oldest ones), but I can't imagine how how those would impact the restore.