scylladb / scylla-operator

The Kubernetes Operator for ScyllaDB
https://operator.docs.scylladb.com/
Apache License 2.0
330 stars 160 forks source link

Cluster creation failing #1794

Closed christopher-wong closed 5 months ago

christopher-wong commented 5 months ago

What happened?

I'm unable to create a new ScyllaDB cluster using the operator. The first pod in a 3 node cluster starts successfully, but the second pod never becomes healthy.

scylla-manager-agent {"L":"INFO","T":"2024-03-06T21:55:36.191Z","N":"wait","M":"Waiting for network connection","sleep":"2s","error":"dial tcp 0.0.0.0:10000: connect: connecti ││ on refused","_trace_id":"nGuL9sDsSIG_DxDpE54-DA"}

scylla E0306 21:56:38.878254       1 sidecar/probes.go:145] "healthz probe: can't connect to Scylla API" err="Get \"http://localhost/system/uptime_ms\": dial tcp [::1]:10000: connect: connection refused" Service="fs-platform/scylladb-normal-ot-nrmotfsvs-1"

pod-2-scylla.log pod-2-agent.log

During startup, pod 2 also logs the following:

Warning  FailedMount       11m (x5 over 11m)  kubelet  MountVolume.SetUp failed for volume "scylladb-serving-certs" : secret "scylladb-local-serving-certs" not found`

Looking at the operator logs, I also see a number of errors:

E0306 21:43:40.546570       1 scyllacluster/controller.go:263] syncing key 'fs-platform/scylladb' failed: [can't sync agent token: can't get agent token: can't get secret fs-platform/scylla-agent-config: secret "scylla-agent-config" not found, can't sync certificates: [can't make certificate "scylladb-local-serving-certs": can't create certificate: can't sign certificate for "": certificate requires either CommonName, IPAddresses or DNSNames to be set, secret "fs-platform/scylladb-local-user-admin" doesn't exist or is not own by this object]]

operator.log

What did you expect to happen?

I expected the operator to successfully create a new 3-node ScyllaDB cluster.

How can we reproduce it (as minimally and precisely as possible)?

Scylla Operator version

1.11.2 (deployed via Helm)

Kubernetes platform name and version

Client Version: v1.29.2 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.26.7+rke2r1

Kubernetes platform info: Rancher, on-prem

Please attach the must-gather archive.

scylla-operator-must-gather-flgzpfjxxzht.zip

Anything else we need to know?

apiVersion: scylla.scylladb.com/v1
kind: ScyllaCluster
metadata:
  labels:
    controller-tools.k8s.io: "1.0"
  name: scylladb
  namespace: fs-platform
spec:
  version: 5.2.9
  agentVersion: 3.1.2
  developerMode: true
  datacenter:
    name: normal-ot
    racks:
      - name: nrmotfsvs
        scyllaConfig: "scylla-config"
        scyllaAgentConfig: "scylla-agent-config"
        members: 3
        storage:
          capacity: 8Gi
          # storageClassName: xfs-local-path
        resources:
          requests:
            cpu: "2"
            memory: "8Gi"
          limits:
            cpu: "2"
            memory: "8Gi"
        volumes:
          - name: coredumpfs
            hostPath:
              path: /tmp/coredumps
        volumeMounts:
          - mountPath: /tmp/coredumps
            name: coredumpfs
christopher-wong commented 5 months ago

This seemed to have been resolved by increasing the aio-max-nr value set on the scylla pod.

spec:
  sysctls:
  - "fs.aio-max-nr=2097152"