k8ssandra / k8ssandra-operator

The Kubernetes operator for K8ssandra
https://k8ssandra.io/
Apache License 2.0
176 stars 79 forks source link

Superuser secret is not getting updated with a new secret reference #566

Open andrey-dubnik opened 2 years ago

andrey-dubnik commented 2 years ago

What happened?

We would like to change the superuser secret reference from default dev-westeurope-01-superuser to cassandra-superuser (password will remain the same, we just wanted to have manual secret)

We have added a block below which works for the new builds

cassandra:          
    superuserSecretRef: 
        name: cassandra-superuser

What we found is this block does not let us change the existing secret reference

1.6551330905527017e+09  DEBUG   events  Warning {"object": {"kind":"CassandraDatacenter","namespace":"temporal-state","name":"primary","uid":"71dc3bb0-708f-4993-8be5-fae8129e85e0","apiVersion":"cassandra.datastax.com/v1beta1","resourceVersion":"205502018"}, "reason": "ValidationFailed", "message": "Could not load superuser secret for CassandraCluster: temporal-state/dev-westeurope-01-superuser"}

We have checked the CassandraDatacenter object reference and indeed this is not updated

spec:
  superuserSecretName: dev-westeurope-01-superuser
  systemLoggerResources: {}
  users:
  - secretName: cassandra-reaper-cql
    superuser: true
  - secretName: cassandra-medusa
    superuser: true

Did you expect to see something different?

I would expect the superuser reference to be updated

How to reproduce it (as minimally and precisely as possible):

Create a cluster with a default superuser secret Update the cluster to use a different superuser secret

Environment

┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: K8OP-170

jsanda commented 2 years ago

cass-operator does not allow the superuser secret reference to be updated to point to a different secret. We should allow the secret reference to change in order to better support rotating and changing credentials. I will create an issue in cass-operator for this.

k8ssandra-operator does allow the reference to be changed which means we unfortunately have some inconsistent behavior.

I have some questions for your to help with the investigation.

Did you deploy the operators with their validating webhooks? If so you should see something like this:

kubectl get validatingwebhookconfigurations
NAME                                                  WEBHOOKS   AGE
cass-operator-validating-webhook-configuration        1          7d4h
cert-manager-webhook                                  1          7d4h
k8ssandra-operator-validating-webhook-configuration   1          7d4h

I ask because the validating webhook for cass-operator prevents the superuser secret reference from being updated. The error you hit happens during reconciliation in cass-operator which means after the webhook runs.

Can you describe the topology for your K8ssandraCluster. Is it deployed across multiple k8s clusters, multiple namespaces?

Is the new secret in the same namespace as the K8ssandraCluster? It needs to be in the same namespace as the K8ssandraCluster.

Can you check and see if the secret has these labels:

app.kubernetes.io/managed-by: k8ssandra-operator
k8ssandra.io/cluster-name: <your-cluster-name>
k8ssandra.io/cluster-namespace: <your-cluster-namespace>

The secret needs to have those labels in order to get replicated to the namespaces/k8s clusters where the CassandraDatacenters are deployed. The error you reported is triggered when cass-operator cannot find the secret. This makes me wonder if the secret was replicated. Note that k8ssandra-operator should able the labels to your secret.

andrey-dubnik commented 2 years ago

Here are the hooks I have which does have the k8ss hook in place

kubectl get validatingwebhookconfigurations
NAME                                                            WEBHOOKS   AGE
actions-runner-controller-validating-webhook-configuration      3          132d
aks-node-validating-webhook                                     1          72d
cert-manager-webhook                                            1          132d
elastic-operator.temporal-visibility.k8s.elastic.co             10         26d
gatekeeper-validating-webhook-configuration                     2          198d
k8ssandra-k8ssandra-operator-validating-webhook-configuration   1          23d
kube-prometheus-stack-admission                                 1          149d

When we initially deployed the k8ss v1 we have use pre-loaded secrets matching the cluster name to drive the password. With v2 we have figured we can now use the provided secret but this happened after we have it deployed.

When we deployed the provided secret we have also deleted the original which was matching the cluster name as thought it was not necessary which have triggered the error of the missing secret.

Surprisingly the secret was updated in few places and failed to update in one, e.g. below is the new secret data which was updated

      initContainers:
      - args:
        - /bin/sh
        - -c
        - echo "$SUPERUSER_JMX_USERNAME $SUPERUSER_JMX_PASSWORD" >> /config/jmxremote.password
          && echo "$REAPER_JMX_USERNAME $REAPER_JMX_PASSWORD" >> /config/jmxremote.password
        env:
        - name: SUPERUSER_JMX_USERNAME
          valueFrom:
            secretKeyRef:
              key: username
              name: cassandra-superuser
        - name: SUPERUSER_JMX_PASSWORD
          valueFrom:
            secretKeyRef:
              key: password
              name: cassandra-superuser
...
  superuserSecretName: dev-westeurope-01-superuser
status:

as we were using the original secret from v1 and loaded it in v2 our cluster name matching secret didn't had those labels... we can add them if this is needed.

andrey-dubnik commented 2 years ago

And yes - the secret we have is in the same namespace with both the operator and cassandra cluster. We only have one cassandra cluster per k8s cluster at the moment.

jsanda commented 2 years ago

Based on what you reported the cass-operator webhook, cass-operator-validating-webhook-configuration, is not deployed. That makes sense given the original error you hit. How did you install k8ssandra-operator? I am curious how you wound up with the k8ssandra-operator webhook deployed but not the cass-operator one.

Go ahead and add the labels to the secret.

Can you share your K8ssandraCluster spec?

andrey-dubnik commented 2 years ago

We use flux to deploy things

This is a normal helm release deployment which seem to be all default values

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: k8ssandra
  namespace: temporal-state
spec:
  releaseName: k8ssandra
  interval: 5m
  chart:
    spec:
      chart: k8ssandra-operator
      version: "=0.37.3"
      sourceRef:
        kind: HelmRepository
        name: k8ssandra
        namespace: temporal-state
  values:
    cass-operator:
      resources: 
        requests:
          cpu: 50m
          memory: 50Mi
        limits:
          cpu: 50m
          memory: 50Mi
    resources:
        requests:
          cpu: 200m
          memory: 256Mi
        limits:
          cpu: 500m
          memory: 256Mi

The cluster specs are following

      apiVersion: k8ssandra.io/v1alpha1
      kind: K8ssandraCluster
      metadata:
        name: ${CASSANDRA_CLUSTER_NAME}
        namespace: temporal-state
      spec:
        reaper:
          cassandraUserSecretRef: 
            name: cassandra-reaper-cql
          jmxUserSecretRef: 
            name: cassandra-reaper-jmx
          uiUserSecretRef: 
            name: cassandra-reaper-ui
        medusa:
          cassandraUserSecretRef: 
            name: cassandra-medusa
          storageProperties:
            storageProvider: azure_blobs
            storageSecretRef:
              name: medusa-azure-credentials
            bucketName: cassandra-backups
        cassandra:          
          serverVersion: "4.0.3"
          superuserSecretRef: 
            name: cassandra-superuser
          datacenters:
            - metadata:
                name: ${CASSANDRA_DATACENTER}
                labels:
                  env: ${ENVIRONMENT_NAME}
                  app: temporal
                  product_id: service-composition
                  provider: azure
                  region: westeurope
                  k8s_cluster: ${CLUSTER_NAME}
                annotations:
                  prometheus.io/scrape: 'true'
                  prometheus.io/port: '9103'
              telemetry:
                prometheus:
                  enabled: true
              size: 3
              storageConfig:
                cassandraDataVolumeClaimSpec:
                  storageClassName: cassandra-csi
                  accessModes:
                    - ReadWriteOnce
                  resources:
                    requests:
                      storage: 1Ti
              resources:
                requests:
                  cpu: 2000m
                  memory: 10Gi
                limits:
                  cpu: 3500m
                  memory: 10Gi
              config:
                jvmOptions:
                  heapSize: 8G
                  gc: G1GC
                  gc_g1_rset_updating_pause_time_percent: 5
                  gc_g1_max_gc_pause_ms: 300
              racks:
              - name: az-1
                nodeAffinityLabels:
                  cassandra-rack: az1
              - name: az-2
                nodeAffinityLabels:
                  cassandra-rack: az2
              - name: az-3
                nodeAffinityLabels:
                  cassandra-rack: az3
andrey-dubnik commented 2 years ago

As all our secrets are in the same namespace there is no issue with the secrets, there was only the issue when I dropped the one matching DC name cause I thought it is no longer needed as I have updated the DC CDR with a new reference. Which secret do we need to add the labels for - the one matching DC name or the new one?

jsanda commented 2 years ago

I thought it is no longer needed as I have updated the DC CDR with a new reference

I think I better understand. You need to make the change through the K8ssandraCluster. You should not directly modify the CassandraDatacenter object. The CassandraDatacenter is created/updated by k8ssandra-operator based on the K8ssandraCluster spec.

When you directly update the CassandraDatacenter, that will trigger a reconciliation in k8ssandra-operator. k8ssandra-operator will see the desired state for the CassandraDatacenter (as determined by the K8ssandraCluster spec) does not match the actual state. It will then update the CassandraDatacenter with the desired state which means you will lose your changes.

Lastly, I misspoke early when I say k8ssandra-operator allows you to change the superuser secret. It performs a check during reconciliation and will end the reconciliation with an error if you change the superuser. I will create separate ticket for this.

andrey-dubnik commented 2 years ago

Some clarification from my end to avoid confusion. I did not actually directly updated the Cassandra DC (it was tempting) cause this would likely have some consequences if done bypassing the k8ss.

What I actually did is updated the k8ss cluster object and dropped the DC named secret which resulted in the error message. Once I got the error message I have put the old secret back (just the secret without any CRD update) and reconciliation completed without any issues. K8ss cluster object still have a new super user secret in the CRD.

After that sequence was completed the cass operator DC object have both the old one and the new super user secret reference in 2 places.

jsanda commented 2 years ago

I'm still having a bit of trouble following :( Can you list out steps to reproduce? Then I will test.

andrey-dubnik commented 2 years ago

Here is the sequence, let me know if anything else needs clarifying

everything is done in the single namespace

  1. install k8ss with a helm
  2. create a secret matching a cluster name
  3. create k8ss cluster CRD without any secret references, it will use the one from step2 by default when deployed
  4. deploy k8ss cluster
  5. create a new secret called cassandra-superuser
  6. update k8ss CRD to reference a superuser secret
  7. drop current secret matching a cluster name
  8. deploy k8ss CRD referencing a new superuser secret
  9. receive the error that there is no secret matching the one from step2
  10. create the secret from the step2
  11. error should be gone and cluster reconciled
  12. examine the CassandraDatacenter object, it should have updated secret for the cassandra-superuser for some attributes and a reference to the secret in step2 for some attributes
lonniev commented 1 year ago

I am interested in how to do this secret reassignment properly. At this stage, I don't mind having to entirely redo the k8ssandra-operator specification and deployment. If there is a way to change the cassandra secret(s) after the fact, that would be useful.

I have an app-stack that wants to take over the Cassandra cluster and regard the cluster as its own with its own secret for administration of Cassandra services. Being able to change the name and password of that Cassandra admin would be convenient.

adejanovski commented 1 year ago

Hi @lonniev,

we yet have to implement proper secret rotation. Right now you have to rotate manually as the operator won't update the credentials in Cassandra. This is definitely on our roadmap though.