mariadb-operator / mariadb-operator

🦭 Run and operate MariaDB in a cloud native way
MIT License
509 stars 101 forks source link

[Bug] Internal error occurred: failed calling webhook #285

Closed perfectra1n closed 9 months ago

perfectra1n commented 12 months ago

Hi there,

Sorry to be a bother again, I know that #267 exists - and I believe this is related but not exactly the same issue? I updated to the latest operator (released today), but it appears as though I'm still having the same issue:

one or more objects failed to apply, reason: Internal error occurred: failed calling webhook "mmariadb.kb.io": failed to call webhook: Post "https://mariadb-op-mariadb-operator-webhook.databases.svc:443/mutate-mariadb-mmontes-io-v1alpha1-mariadb?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "mariadb-op-mariadb-operator-webhook.databases.svc")

This occurs when I try to change spec.image for my active MariaDB resource. Is there a way to debug/workaround this error? I don't believe the error is specific to Galera. I installed the operator via Helm quite a while ago...

If there's any other information I can provide, please let me know :)

image

mmontes11 commented 12 months ago

Hey there, have you tried upgrading to 0.23.1? I've fixed an issue related to the certificates.

perfectra1n commented 12 months ago

Hi! Yeah, I just upgraded to 0.23.1 as well: image

Now I get the following output when viewing the application from ArgoCD:

Name:               argocd/mariadb-op
Project:            default
Server:             https://kubernetes.default.svc
Namespace:          databases
URL:                https://argocd.domain.network/applications/mariadb-op
Repo:               https://mariadb-operator.github.io/mariadb-operator
Target:             0.23.1
Path:
SyncWindow:         Sync Allowed
Sync Policy:        Automated (Prune)
Sync Status:        Synced to 0.23.1
Health Status:      Degraded

GROUP                         KIND                            NAMESPACE  NAME                                           STATUS     HEALTH    HOOK  MESSAGE
                              Namespace                                  databases                                      Succeeded  Synced          namespace/databases serverside-applied
                              Secret                          databases  mariadb-op-mariadb-operator-webhook-cert       Succeeded  Pruned          pruned
                              Service                         databases  mariadb-op-mariadb-operator-webhook            Synced     Healthy
                              ServiceAccount                  databases  mariadb-op-mariadb-operator                    Synced
                              ServiceAccount                  databases  mariadb-op-mariadb-operator-webhook            Synced
admissionregistration.k8s.io  MutatingWebhookConfiguration               mariadb-op-mariadb-operator-webhook            Synced
admissionregistration.k8s.io  ValidatingWebhookConfiguration             mariadb-op-mariadb-operator-webhook            Synced
apiextensions.k8s.io          CustomResourceDefinition                   backups.mariadb.mmontes.io                     Synced
apiextensions.k8s.io          CustomResourceDefinition                   connections.mariadb.mmontes.io                 Synced
apiextensions.k8s.io          CustomResourceDefinition                   databases.mariadb.mmontes.io                   Synced
apiextensions.k8s.io          CustomResourceDefinition                   grants.mariadb.mmontes.io                      Synced
apiextensions.k8s.io          CustomResourceDefinition                   mariadbs.mariadb.mmontes.io                    Synced
apiextensions.k8s.io          CustomResourceDefinition                   restores.mariadb.mmontes.io                    Synced
apiextensions.k8s.io          CustomResourceDefinition                   sqljobs.mariadb.mmontes.io                     Synced
apiextensions.k8s.io          CustomResourceDefinition                   users.mariadb.mmontes.io                       Synced
apps                          Deployment                      databases  mariadb-op-mariadb-operator                    Synced     Healthy
apps                          Deployment                      databases  mariadb-op-mariadb-operator-webhook            Synced     Healthy
cert-manager.io               Certificate                     databases  mariadb-op-mariadb-operator-webhook-cert       Synced     Degraded
cert-manager.io               Issuer                          databases  mariadb-op-mariadb-operator-selfsigned-issuer  Synced     Healthy
rbac.authorization.k8s.io     ClusterRole                                mariadb-op-mariadb-operator                    Synced
rbac.authorization.k8s.io     ClusterRoleBinding                         mariadb-op-mariadb-operator                    Synced
rbac.authorization.k8s.io     ClusterRoleBinding                         mariadb-op-mariadb-operator:auth-delegator     Synced
rbac.authorization.k8s.io     Role                            databases  mariadb-op-mariadb-operator                    Synced
rbac.authorization.k8s.io     RoleBinding                     databases  mariadb-op-mariadb-operator                    Synced

So it looks like the mariadb-op-mariadb-operator-webhook-cert is in some kind of a weird state now, I'll keep trying to debug...

Could it possibly be because the same resource of the same name was pruned above? I'm assuming not since it's a differente resource type...

perfectra1n commented 12 months ago

I had to delete the cert mariadb-op-mariadb-operator-webhook-cert, which was in the Degraded state, and it looks like it's recreated now:

Name:               argocd/mariadb-op
Project:            default
Server:             https://kubernetes.default.svc
Namespace:          databases
URL:                https://argocd.domain.network/applications/mariadb-op
Repo:               https://mariadb-operator.github.io/mariadb-operator
Target:             0.23.1
Path:
SyncWindow:         Sync Allowed
Sync Policy:        Automated (Prune)
Sync Status:        OutOfSync from 0.23.1
Health Status:      Healthy

GROUP                         KIND                            NAMESPACE  NAME                                           STATUS     HEALTH   HOOK  MESSAGE
                              Namespace                                  databases                                      Succeeded  Synced         namespace/databases serverside-applied
                              Secret                          databases  mariadb-op-mariadb-operator-webhook-cert       OutOfSync                 pruned
                              Service                         databases  mariadb-op-mariadb-operator-webhook            Synced     Healthy
                              ServiceAccount                  databases  mariadb-op-mariadb-operator                    Synced
                              ServiceAccount                  databases  mariadb-op-mariadb-operator-webhook            Synced
admissionregistration.k8s.io  MutatingWebhookConfiguration               mariadb-op-mariadb-operator-webhook            Synced
admissionregistration.k8s.io  ValidatingWebhookConfiguration             mariadb-op-mariadb-operator-webhook            Synced
apiextensions.k8s.io          CustomResourceDefinition                   backups.mariadb.mmontes.io                     Synced
apiextensions.k8s.io          CustomResourceDefinition                   connections.mariadb.mmontes.io                 Synced
apiextensions.k8s.io          CustomResourceDefinition                   databases.mariadb.mmontes.io                   Synced
apiextensions.k8s.io          CustomResourceDefinition                   grants.mariadb.mmontes.io                      Synced
apiextensions.k8s.io          CustomResourceDefinition                   mariadbs.mariadb.mmontes.io                    Synced
apiextensions.k8s.io          CustomResourceDefinition                   restores.mariadb.mmontes.io                    Synced
apiextensions.k8s.io          CustomResourceDefinition                   sqljobs.mariadb.mmontes.io                     Synced
apiextensions.k8s.io          CustomResourceDefinition                   users.mariadb.mmontes.io                       Synced
apps                          Deployment                      databases  mariadb-op-mariadb-operator                    Synced     Healthy
apps                          Deployment                      databases  mariadb-op-mariadb-operator-webhook            Synced     Healthy
cert-manager.io               Certificate                     databases  mariadb-op-mariadb-operator-webhook-cert       Synced     Healthy
cert-manager.io               Issuer                          databases  mariadb-op-mariadb-operator-selfsigned-issuer  Synced     Healthy
rbac.authorization.k8s.io     ClusterRole                                mariadb-op-mariadb-operator                    Synced
rbac.authorization.k8s.io     ClusterRoleBinding                         mariadb-op-mariadb-operator                    Synced
rbac.authorization.k8s.io     ClusterRoleBinding                         mariadb-op-mariadb-operator:auth-delegator     Synced
rbac.authorization.k8s.io     Role                            databases  mariadb-op-mariadb-operator                    Synced
rbac.authorization.k8s.io     RoleBinding                     databases  mariadb-op-mariadb-operator                    Synced
perfectra1n commented 12 months ago

But then the same error reoccurs:


one or more objects failed to apply, reason: Internal error occurred: failed calling webhook "mmariadb.kb.io": failed to call webhook: Post "https://mariadb-op-mariadb-operator-webhook.databases.svc:443/mutate-mariadb-mmontes-io-v1alpha1-mariadb?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "x509: invalid signature: parent certificate cannot sign this kind of certificate" while trying to verify candidate authority certificate "mariadb-op-mariadb-operator-webhook.databases.svc").

image

Here are the values that I'm using for the operator's deployment:

  clusterName: "newcluster.local"
  ha:
    enabled: true
  webhook:
    cert:
      certManager:
        enabled: true

And here's the Database:

apiVersion: mariadb.mmontes.io/v1alpha1
kind: MariaDB
metadata:
  name: mariadb
  namespace: databases
  annotations:
    argocd.argoproj.io/compare-options: IgnoreExtraneous
    argocd.argoproj.io/sync-options: Prune=false
spec:
  # Error writing Galera config: open /etc/mysql/mariadb.conf.d/0-galera.cnf: permission denied
  podSecurityContext:
    runAsUser: 0

  rootPasswordSecretKeyRef:
    name: mariadb-creds
    key: root-password

  database: mariadb
  username: mainuser
  passwordSecretKeyRef:
    name: mariadb-creds
    key: password

  image: mariadb:11.0.3

  port: 3306

  replicas: 5

  galera:
    enabled: true
    primary:
      podIndex: 0
      automaticFailover: true
    sst: mariabackup
    replicaThreads: 1
    agent:
      image: ghcr.io/mariadb-operator/agent:v0.0.3
      port: 5555
      kubernetesAuth:
        enabled: true
      gracefulShutdownTimeout: 5s
    recovery:
      enabled: true
      clusterHealthyTimeout: 3m0s
      clusterBootstrapTimeout: 10m0s
      podRecoveryTimeout: 5m0s
      podSyncTimeout: 5m0s
    initContainer:
      image: ghcr.io/mariadb-operator/init:v0.0.6
    volumeClaimTemplate:
      resources:
        requests:
          storage: 300Mi
      accessModes:
        - ReadWriteOnce

  service:
    type: LoadBalancer
    annotations:
      metallb.universe.tf/ip-allocated-from-pool: first-pool
      metallb.universe.tf/loadBalancerIPs: 10.11.0.30
  connection:
    secretName: mariadb-galera-conn
    secretTemplate:
      key: dsn

I'm getting that error to occur when updating the spec.image within the above MariaDB resource to a newer version.

mmontes11 commented 12 months ago

Hey there! thanks for reporting this with so many details.

I've managed to install the 0.23.1 chart with the same values as you and to successfully apply a Mariadb resources later on, which means that the webhook is responding correctly.

Judging by the x509 error you reported, it seems like the CA that signed the new certificates is unknown and untrusted by the webhook. This CA is injected by cert-manager in the ValidatingWebhookConfiguration and the MutatingWebhookConfiguration objects, which probably might be in a weird intermediate state. Could you try deleting them and resyncing your Argocd so we get some fresh new ones?

perfectra1n commented 12 months ago

Thanks for reviewing my deluge of information! I always try to report too much rather than too little.

I went ahead and deleted the ValidatingWebhookConfiguration and MutatingWebhookConfiguration resources managed by mariadb-operator.

Interestingly enough, it appears as though the Secret named mariadb-op-mariadb-operator-webhook-cert is just being created over and over again, even though I'm using the Certificate resource named mariadb-op-mariadb-operator-webhook-cert with the new chart.

perfectra1n commented 12 months ago

I see that you have:

{{- if not .Values.webhook.cert.certManager.enabled }}

within webhook-secret.yaml, so I'm surprised it's still being recreated...

perfectra1n commented 12 months ago

It's just spamming CertificateRequests too 😬 image

I see the same logs as #267 too, from the webhook pod...

{"level":"info","ts":1701813745.8802757,"logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"info","ts":1701813808.9075487,"logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"info","ts":1701813808.9087875,"logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"info","ts":1701813897.8882475,"logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"info","ts":1701813897.8895173,"logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"info","ts":1701813982.8825915,"logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"info","ts":1701813982.8852727,"logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"info","ts":1701814061.2421184,"logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"info","ts":1701814061.2433603,"logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
perfectra1n commented 12 months ago

I'm seriously trash at Golang, so please do take what I say with a grain of salt...

Is it possible that during reconcile, the controller is "erroring" as it's waiting for a cert from certmanager (couldn't find the certmanager code, my fault lol) and then creating the key here?

I'm really not sure why it's just creating CertificateRequest after CertificateRequest, instead of waiting for certmanager to pick them up?

image

perfectra1n commented 12 months ago

Webhook logs:

{"level":"info","ts":1701814552.278027,"logger":"setup","msg":"Starting manager"}
{"level":"info","ts":1701814552.2785738,"logger":"controller-runtime.metrics","msg":"Starting metrics server"}
{"level":"info","ts":1701814552.2787123,"logger":"controller-runtime.metrics","msg":"Serving metrics server","bindAddress":":8080","secure":false}
{"level":"info","ts":1701814552.2788506,"msg":"starting server","kind":"health probe","addr":"[::]:8081"}
{"level":"info","ts":1701814552.2788734,"logger":"controller-runtime.webhook","msg":"Starting webhook server"}
{"level":"info","ts":1701814552.2794907,"logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"info","ts":1701814552.2795625,"logger":"controller-runtime.webhook","msg":"Serving webhook server","host":"","port":10250}
{"level":"info","ts":1701814552.2797909,"logger":"controller-runtime.certwatcher","msg":"Starting certificate watcher"}
{"level":"debug","ts":1701814554.1360002,"logger":"controller-runtime.certwatcher","msg":"certificate event","event":"REMOVE        \"/tmp/k8s-webhook-server/serving-certs/tls.key\""}
{"level":"info","ts":1701814554.1377637,"logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"debug","ts":1701814554.1379616,"logger":"controller-runtime.certwatcher","msg":"certificate event","event":"REMOVE        \"/tmp/k8s-webhook-server/serving-certs/tls.crt\""}
{"level":"info","ts":1701814554.1401744,"logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"debug","ts":1701814576.0212483,"logger":"controller-runtime.certwatcher","msg":"certificate event","event":"REMOVE        \"/tmp/k8s-webhook-server/serving-certs/tls.key\""}
{"level":"info","ts":1701814576.023399,"logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"debug","ts":1701814576.0234637,"logger":"controller-runtime.certwatcher","msg":"certificate event","event":"REMOVE        \"/tmp/k8s-webhook-server/serving-certs/tls.crt\""}
{"level":"info","ts":1701814576.0247002,"logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}
{"level":"debug","ts":1701814663.880187,"logger":"controller-runtime.certwatcher","msg":"certificate event","event":"REMOVE        \"/tmp/k8s-webhook-server/serving-certs/tls.key\""}

Operator logs:

{"level":"info","ts":1701814553.3222492,"logger":"setup","msg":"Starting manager"}
{"level":"info","ts":1701814553.3223999,"logger":"controller-runtime.metrics","msg":"Starting metrics server"}
{"level":"info","ts":1701814553.32256,"msg":"starting server","kind":"health probe","addr":"[::]:8081"}
{"level":"info","ts":1701814553.3225813,"logger":"controller-runtime.metrics","msg":"Serving metrics server","bindAddress":":8080","secure":false}
I1205 22:15:53.424785       1 leaderelection.go:250] attempting to acquire leader lease databases/mariadb-operator.mmontes.io...
I1205 22:16:15.417191       1 leaderelection.go:260] successfully acquired lease databases/mariadb-operator.mmontes.io
{"level":"debug","ts":1701814575.417311,"logger":"events","msg":"mariadb-op-mariadb-operator-5867955d4b-gzqhm_82d870e6-e730-478c-80b5-23984d7c93fc became leader","type":"Normal","object":{"kind":"Lease","namespace":"databases","name":"mariadb-operator.mmontes.io","uid":"c1aeccc4-2f0c-45f9-85b2-9d223c665e00","apiVersion":"coordination.k8s.io/v1","resourceVersion":"255861041"},"reason":"LeaderElection"}
{"level":"info","ts":1701814575.4184875,"msg":"Starting EventSource","controller":"restore","controllerGroup":"mariadb.mmontes.io","controllerKind":"Restore","source":"kind source: *v1alpha1.Restore"}
{"level":"info","ts":1701814575.4196618,"msg":"Starting EventSource","controller":"restore","controllerGroup":"mariadb.mmontes.io","controllerKind":"Restore","source":"kind source: *v1.Job"}
{"level":"info","ts":1701814575.4197085,"msg":"Starting Controller","controller":"restore","controllerGroup":"mariadb.mmontes.io","controllerKind":"Restore"}
{"level":"info","ts":1701814575.4225392,"msg":"Starting EventSource","controller":"connection","controllerGroup":"mariadb.mmontes.io","controllerKind":"Connection","source":"kind source: *v1alpha1.Connection"}
{"level":"info","ts":1701814575.4229555,"msg":"Starting EventSource","controller":"mariadb","controllerGroup":"mariadb.mmontes.io","controllerKind":"MariaDB","source":"kind source: *v1alpha1.MariaDB"}
{"level":"info","ts":1701814575.4231176,"msg":"Starting EventSource","controller":"mariadb","controllerGroup":"mariadb.mmontes.io","controllerKind":"MariaDB","source":"kind source: *v1alpha1.Connection"}
{"level":"info","ts":1701814575.4232342,"msg":"Starting EventSource","controller":"mariadb","controllerGroup":"mariadb.mmontes.io","controllerKind":"MariaDB","source":"kind source: *v1alpha1.Restore"}
{"level":"info","ts":1701814575.4232583,"msg":"Starting EventSource","controller":"mariadb","controllerGroup":"mariadb.mmontes.io","controllerKind":"MariaDB","source":"kind source: *v1.ConfigMap"}
{"level":"info","ts":1701814575.423286,"msg":"Starting EventSource","controller":"mariadb","controllerGroup":"mariadb.mmontes.io","controllerKind":"MariaDB","source":"kind source: *v1.Service"}
{"level":"info","ts":1701814575.4233344,"msg":"Starting EventSource","controller":"mariadb","controllerGroup":"mariadb.mmontes.io","controllerKind":"MariaDB","source":"kind source: *v1.Secret"}
{"level":"info","ts":1701814575.4233549,"msg":"Starting EventSource","controller":"mariadb","controllerGroup":"mariadb.mmontes.io","controllerKind":"MariaDB","source":"kind source: *v1.Event"}
{"level":"info","ts":1701814575.4233735,"msg":"Starting EventSource","controller":"mariadb","controllerGroup":"mariadb.mmontes.io","controllerKind":"MariaDB","source":"kind source: *v1.ServiceAccount"}
{"level":"info","ts":1701814575.4233916,"msg":"Starting EventSource","controller":"mariadb","controllerGroup":"mariadb.mmontes.io","controllerKind":"MariaDB","source":"kind source: *v1.StatefulSet"}
{"level":"info","ts":1701814575.423419,"msg":"Starting EventSource","controller":"backup","controllerGroup":"mariadb.mmontes.io","controllerKind":"Backup","source":"kind source: *v1alpha1.Backup"}
{"level":"info","ts":1701814575.4235032,"msg":"Starting EventSource","controller":"backup","controllerGroup":"mariadb.mmontes.io","controllerKind":"Backup","source":"kind source: *v1.CronJob"}
{"level":"info","ts":1701814575.4235187,"msg":"Starting EventSource","controller":"mariadb","controllerGroup":"mariadb.mmontes.io","controllerKind":"MariaDB","source":"kind source: *v1.PodDisruptionBudget"}
{"level":"info","ts":1701814575.4235673,"msg":"Starting EventSource","controller":"backup","controllerGroup":"mariadb.mmontes.io","controllerKind":"Backup","source":"kind source: *v1.Job"}
{"level":"info","ts":1701814575.4242637,"msg":"Starting Controller","controller":"backup","controllerGroup":"mariadb.mmontes.io","controllerKind":"Backup"}
{"level":"info","ts":1701814575.425194,"msg":"Starting EventSource","controller":"statefulset","controllerGroup":"apps","controllerKind":"StatefulSet","source":"kind source: *v1.StatefulSet"}
{"level":"info","ts":1701814575.425274,"msg":"Starting Controller","controller":"statefulset","controllerGroup":"apps","controllerKind":"StatefulSet"}
{"level":"info","ts":1701814575.4252162,"msg":"Starting EventSource","controller":"sqljob","controllerGroup":"mariadb.mmontes.io","controllerKind":"SqlJob","source":"kind source: *v1alpha1.SqlJob"}
{"level":"info","ts":1701814575.4253242,"msg":"Starting EventSource","controller":"connection","controllerGroup":"mariadb.mmontes.io","controllerKind":"Connection","source":"kind source: *v1.Secret"}
{"level":"info","ts":1701814575.4235914,"msg":"Starting EventSource","controller":"mariadb","controllerGroup":"mariadb.mmontes.io","controllerKind":"MariaDB","source":"kind source: *v1.Role"}
{"level":"info","ts":1701814575.4254615,"msg":"Starting EventSource","controller":"sqljob","controllerGroup":"mariadb.mmontes.io","controllerKind":"SqlJob","source":"kind source: *v1.ConfigMap"}
{"level":"info","ts":1701814575.4263847,"msg":"Starting Controller","controller":"connection","controllerGroup":"mariadb.mmontes.io","controllerKind":"Connection"}
{"level":"info","ts":1701814575.4264429,"msg":"Starting EventSource","controller":"pod","controllerGroup":"","controllerKind":"Pod","source":"kind source: *v1.Pod"}
{"level":"info","ts":1701814575.4265463,"msg":"Starting EventSource","controller":"pod","controllerGroup":"","controllerKind":"Pod","source":"kind source: *v1.Pod"}
{"level":"info","ts":1701814575.4265692,"msg":"Starting EventSource","controller":"grant","controllerGroup":"mariadb.mmontes.io","controllerKind":"Grant","source":"kind source: *v1alpha1.Grant"}
{"level":"info","ts":1701814575.4278758,"msg":"Starting Controller","controller":"pod","controllerGroup":"","controllerKind":"Pod"}
{"level":"info","ts":1701814575.4279294,"msg":"Starting Controller","controller":"pod","controllerGroup":"","controllerKind":"Pod"}
{"level":"info","ts":1701814575.4235246,"msg":"Starting EventSource","controller":"user","controllerGroup":"mariadb.mmontes.io","controllerKind":"User","source":"kind source: *v1alpha1.User"}
{"level":"info","ts":1701814575.4254866,"msg":"Starting EventSource","controller":"mariadb","controllerGroup":"mariadb.mmontes.io","controllerKind":"MariaDB","source":"kind source: *v1.RoleBinding"}
{"level":"info","ts":1701814575.4281435,"msg":"Starting EventSource","controller":"grant","controllerGroup":"mariadb.mmontes.io","controllerKind":"Grant","source":"kind source: *v1alpha1.User"}
{"level":"info","ts":1701814575.428208,"msg":"Starting Controller","controller":"grant","controllerGroup":"mariadb.mmontes.io","controllerKind":"Grant"}
{"level":"info","ts":1701814575.4281247,"msg":"Starting Controller","controller":"user","controllerGroup":"mariadb.mmontes.io","controllerKind":"User"}
{"level":"info","ts":1701814575.4284348,"msg":"Starting EventSource","controller":"database","controllerGroup":"mariadb.mmontes.io","controllerKind":"Database","source":"kind source: *v1alpha1.Database"}
{"level":"info","ts":1701814575.4285102,"msg":"Starting EventSource","controller":"mariadb","controllerGroup":"mariadb.mmontes.io","controllerKind":"MariaDB","source":"kind source: *v1.ClusterRoleBinding"}
{"level":"info","ts":1701814575.4286468,"msg":"Starting Controller","controller":"database","controllerGroup":"mariadb.mmontes.io","controllerKind":"Database"}
{"level":"info","ts":1701814575.4286635,"msg":"Starting Controller","controller":"mariadb","controllerGroup":"mariadb.mmontes.io","controllerKind":"MariaDB"}
{"level":"info","ts":1701814575.4288566,"msg":"Starting EventSource","controller":"sqljob","controllerGroup":"mariadb.mmontes.io","controllerKind":"SqlJob","source":"kind source: *v1.CronJob"}
{"level":"info","ts":1701814575.429329,"msg":"Starting EventSource","controller":"sqljob","controllerGroup":"mariadb.mmontes.io","controllerKind":"SqlJob","source":"kind source: *v1.Job"}
{"level":"info","ts":1701814575.4293923,"msg":"Starting Controller","controller":"sqljob","controllerGroup":"mariadb.mmontes.io","controllerKind":"SqlJob"}
{"level":"info","ts":1701814575.9881175,"msg":"Starting workers","controller":"grant","controllerGroup":"mariadb.mmontes.io","controllerKind":"Grant","worker count":1}
{"level":"info","ts":1701814575.9881206,"msg":"Starting workers","controller":"statefulset","controllerGroup":"apps","controllerKind":"StatefulSet","worker count":1}
{"level":"info","ts":1701814575.9891937,"logger":"galera.health","msg":"Checking Galera cluster health","controller":"statefulset","controllerGroup":"apps","controllerKind":"StatefulSet","StatefulSet":{"name":"mariadb","namespace":"databases"},"namespace":"databases","name":"mariadb","reconcileID":"22c41f16-69c6-4b5a-ae2c-70561ca18d79"}
{"level":"debug","ts":1701814575.989259,"logger":"galera.health","msg":"StatefulSet ready replicas","controller":"statefulset","controllerGroup":"apps","controllerKind":"StatefulSet","StatefulSet":{"name":"mariadb","namespace":"databases"},"namespace":"databases","name":"mariadb","reconcileID":"22c41f16-69c6-4b5a-ae2c-70561ca18d79","replicas":5}
{"level":"info","ts":1701814576.0064301,"msg":"Starting workers","controller":"user","controllerGroup":"mariadb.mmontes.io","controllerKind":"User","worker count":1}
{"level":"info","ts":1701814576.0066135,"msg":"Starting workers","controller":"database","controllerGroup":"mariadb.mmontes.io","controllerKind":"Database","worker count":1}
{"level":"info","ts":1701814576.0069342,"msg":"Starting workers","controller":"connection","controllerGroup":"mariadb.mmontes.io","controllerKind":"Connection","worker count":1}
{"level":"info","ts":1701814576.0067024,"msg":"Starting workers","controller":"pod","controllerGroup":"","controllerKind":"Pod","worker count":1}
{"level":"info","ts":1701814576.0389016,"msg":"Starting workers","controller":"restore","controllerGroup":"mariadb.mmontes.io","controllerKind":"Restore","worker count":1}
{"level":"info","ts":1701814576.0390863,"msg":"Starting workers","controller":"sqljob","controllerGroup":"mariadb.mmontes.io","controllerKind":"SqlJob","worker count":1}
{"level":"info","ts":1701814576.0392168,"msg":"Starting workers","controller":"pod","controllerGroup":"","controllerKind":"Pod","worker count":1}
{"level":"debug","ts":1701814576.0398002,"msg":"Reconciling Pod in Ready state","controller":"pod","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"mariadb-3","namespace":"databases"},"namespace":"databases","name":"mariadb-3","reconcileID":"f3ac37e3-4a1f-49f0-a255-93e1f8391ba9","pod":"mariadb-3"}
{"level":"debug","ts":1701814576.0401933,"msg":"Reconciling Pod in Ready state","controller":"pod","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"mariadb-2","namespace":"databases"},"namespace":"databases","name":"mariadb-2","reconcileID":"4b3819b9-568a-44e1-b1d1-a8841d807906","pod":"mariadb-2"}
{"level":"debug","ts":1701814576.0404835,"msg":"Reconciling Pod in Ready state","controller":"pod","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"mariadb-1","namespace":"databases"},"namespace":"databases","name":"mariadb-1","reconcileID":"9b560c72-fb59-4594-a7d8-0113f7d3e1af","pod":"mariadb-1"}
{"level":"debug","ts":1701814576.0407934,"msg":"Reconciling Pod in Ready state","controller":"pod","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"mariadb-0","namespace":"databases"},"namespace":"databases","name":"mariadb-0","reconcileID":"6270af06-cc36-4112-806a-0230a4d6ded7","pod":"mariadb-0"}
{"level":"debug","ts":1701814576.0411425,"msg":"Reconciling Pod in Ready state","controller":"pod","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"mariadb-4","namespace":"databases"},"namespace":"databases","name":"mariadb-4","reconcileID":"7d36688b-8e28-4497-8d1c-c845432336e3","pod":"mariadb-4"}
{"level":"info","ts":1701814576.05047,"msg":"Starting workers","controller":"backup","controllerGroup":"mariadb.mmontes.io","controllerKind":"Backup","worker count":1}
{"level":"info","ts":1701814576.0505714,"msg":"Starting workers","controller":"mariadb","controllerGroup":"mariadb.mmontes.io","controllerKind":"MariaDB","worker count":1}
{"level":"debug","ts":1701814576.1094224,"msg":"Checking connection health","controller":"connection","controllerGroup":"mariadb.mmontes.io","controllerKind":"Connection","Connection":{"name":"mariadb-primary","namespace":"databases"},"namespace":"databases","name":"mariadb-primary","reconcileID":"56d6a029-d83b-41e8-a5d1-0824b061f9d1"}
{"level":"debug","ts":1701814576.3680446,"msg":"Checking connection health","controller":"connection","controllerGroup":"mariadb.mmontes.io","controllerKind":"Connection","Connection":{"name":"mariadb-secondary","namespace":"databases"},"namespace":"databases","name":"mariadb-secondary","reconcileID":"31f41e37-6244-4822-a4d1-42c4e49ff995"}
{"level":"debug","ts":1701814576.453957,"msg":"Checking connection health","controller":"connection","controllerGroup":"mariadb.mmontes.io","controllerKind":"Connection","Connection":{"name":"mariadb","namespace":"databases"},"namespace":"databases","name":"mariadb","reconcileID":"e631959a-cc62-40fe-b73f-b4d7547b20c4"}
perfectra1n commented 12 months ago

Well I was able to stop it from erroring constantly by removing the certManager values, so it's just:

clusterName: "newcluster.local"
ha:
  enabled: true

However, after changing the spec.image from 11.0.3 to 11.2.2, I just get the following error over and over again as it tries to do a rolling update...

2023-12-05 23:53:10+00:00 [Note] [Entrypoint]: Entrypoint script for MariaDB Server 1:11.2.2+maria~ubu2204 started.
2023-12-05 23:53:11+00:00 [Note] [Entrypoint]: Switching to dedicated user 'mysql'
2023-12-05 23:53:11+00:00 [Note] [Entrypoint]: Entrypoint script for MariaDB Server 1:11.2.2+maria~ubu2204 started.
2023-12-05 23:53:11+00:00 [Note] [Entrypoint]: MariaDB upgrade information missing, assuming required
2023-12-05 23:53:11+00:00 [Note] [Entrypoint]: MariaDB upgrade (mariadb-upgrade) required, but skipped due to $MARIADB_AUTO_UPGRADE setting
2023-12-05 23:53:11 0 [Note] Starting MariaDB 11.2.2-MariaDB-1:11.2.2+maria~ubu2204 source revision 929532a9426d085111c24c63de9c23cc54382259 as process 1
2023-12-05 23:53:11 0 [Note] WSREP: Loading provider /usr/lib/galera/libgalera_smm.so initial position: 00000000-0000-0000-0000-000000000000:-1
2023-12-05 23:53:11 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/galera/libgalera_smm.so'
2023-12-05 23:53:11 0 [Note] WSREP: wsrep_load(): Galera 26.4.16(r7dce5149) by Codership Oy <info@codership.com> loaded successfully.
2023-12-05 23:53:11 0 [Note] WSREP: Initializing allowlist service v1
2023-12-05 23:53:11 0 [Note] WSREP: Initializing event service v1
2023-12-05 23:53:11 0 [Note] WSREP: CRC-32C: using 64-bit x86 acceleration.
2023-12-05 23:53:11 0 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1, safe_to_bootstrap: 0
2023-12-05 23:53:11 0 [Note] WSREP: GCache DEBUG: opened preamble:
Version: 2
UUID: 82507f11-767c-11ee-b214-532d45cfffd9
Seqno: -1 - -1
Offset: -1
Synced: 0
2023-12-05 23:53:11 0 [Note] WSREP: Recovering GCache ring buffer: version: 2, UUID: 82507f11-767c-11ee-b214-532d45cfffd9, offset: -1
2023-12-05 23:53:11 0 [Note] WSREP: GCache::RingBuffer initial scan...  0.0% (        0/134217752 bytes) complete.
2023-12-05 23:53:11 0 [Note] WSREP: GCache::RingBuffer initial scan...100.0% (134217752/134217752 bytes) complete.
2023-12-05 23:53:11 0 [Note] WSREP: Recovering GCache ring buffer: Recovery failed, need to do full reset.
2023-12-05 23:53:11 0 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = mariadb-4.mariadb-internal.databases.svc.newcluster.local; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.keep_plaintext_size = 128M; gcache.mem_size = 0; gcache.name = galera.cache; gcache.page_size = 128M; gcache.recover = yes; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.fc_single_primary = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0
2023-12-05 23:53:11 0 [Note] WSREP: Start replication
2023-12-05 23:53:11 0 [Note] WSREP: Connecting with bootstrap option: 0
2023-12-05 23:53:11 0 [Note] WSREP: Setting GCS initial position to 00000000-0000-0000-0000-000000000000:-1
2023-12-05 23:53:11 0 [Note] WSREP: protonet asio version 0
2023-12-05 23:53:11 0 [Note] WSREP: Using CRC-32C for message checksums.
2023-12-05 23:53:11 0 [Note] WSREP: backend: asio
2023-12-05 23:53:11 0 [Note] WSREP: gcomm thread scheduling priority set to other:0
2023-12-05 23:53:11 0 [Note] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
2023-12-05 23:53:11 0 [Note] WSREP: restore pc from disk failed
2023-12-05 23:53:11 0 [Note] WSREP: GMCast version 0
2023-12-05 23:53:11 0 [Note] WSREP: (72a61da0-93c4, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2023-12-05 23:53:11 0 [Note] WSREP: (72a61da0-93c4, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2023-12-05 23:53:11 0 [Note] WSREP: EVS version 1
2023-12-05 23:53:11 0 [Note] WSREP: gcomm: connecting to group 'mariadb-operator', peer 'mariadb-0.mariadb-internal.databases.svc.newcluster.local:,mariadb-1.mariadb-internal.databases.svc.newcluster.local:,mariadb-2.mariadb-internal.databases.svc.newcluster.local:,mariadb-3.mariadb-internal.databases.svc.newcluster.local:,mariadb-4.mariadb-internal.databases.svc.newcluster.local:'
2023-12-05 23:53:11 0 [Note] WSREP: (72a61da0-93c4, 'tcp://0.0.0.0:4567') Found matching local endpoint for a connection, blacklisting address tcp://10.233.101.135:4567
2023-12-05 23:53:11 0 [Note] WSREP: (72a61da0-93c4, 'tcp://0.0.0.0:4567') connection established to c017ceee-af4a tcp://10.233.69.15:4567
2023-12-05 23:53:11 0 [Note] WSREP: (72a61da0-93c4, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
2023-12-05 23:53:11 0 [Note] WSREP: (72a61da0-93c4, 'tcp://0.0.0.0:4567') connection established to 37718ade-864d tcp://10.233.94.155:4567
2023-12-05 23:53:11 0 [Note] WSREP: (72a61da0-93c4, 'tcp://0.0.0.0:4567') connection established to ac774553-8db1 tcp://10.233.91.185:4567
2023-12-05 23:53:11 0 [Note] WSREP: (72a61da0-93c4, 'tcp://0.0.0.0:4567') connection established to a0738f2d-bcd9 tcp://10.233.91.190:4567
2023-12-05 23:53:13 0 [Note] WSREP: EVS version upgrade 0 -> 1
2023-12-05 23:53:13 0 [Note] WSREP: declaring 37718ade-864d at tcp://10.233.94.155:4567 stable
2023-12-05 23:53:13 0 [Note] WSREP: declaring a0738f2d-bcd9 at tcp://10.233.91.190:4567 stable
2023-12-05 23:53:13 0 [Note] WSREP: declaring ac774553-8db1 at tcp://10.233.91.185:4567 stable
2023-12-05 23:53:13 0 [Note] WSREP: declaring c017ceee-af4a at tcp://10.233.69.15:4567 stable
2023-12-05 23:53:13 0 [Note] WSREP: PC protocol upgrade 0 -> 1
2023-12-05 23:53:14 0 [Note] WSREP: Node 37718ade-864d state prim
2023-12-05 23:53:14 0 [Note] WSREP: view(view_id(PRIM,37718ade-864d,383) memb {
    37718ade-864d,0
    72a61da0-93c4,0
    a0738f2d-bcd9,0
    ac774553-8db1,0
    c017ceee-af4a,0
} joined {
} left {
} partitioned {
})
2023-12-05 23:53:14 0 [Note] WSREP: save pc into disk
2023-12-05 23:53:14 0 [Note] WSREP: gcomm: connected
2023-12-05 23:53:14 0 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2023-12-05 23:53:14 0 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2023-12-05 23:53:14 0 [Note] WSREP: Opened channel 'mariadb-operator'
2023-12-05 23:53:14 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 5
2023-12-05 23:53:14 0 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2023-12-05 23:53:14 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 742e9eca-93c9-11ee-9732-16a8b18858ee
2023-12-05 23:53:14 0 [Note] WSREP: STATE EXCHANGE: got state msg: 742e9eca-93c9-11ee-9732-16a8b18858ee from 0 (mariadb-3)
2023-12-05 23:53:14 0 [Note] WSREP: Initializing config service v1
2023-12-05 23:53:14 0 [Note] WSREP: STATE EXCHANGE: got state msg: 742e9eca-93c9-11ee-9732-16a8b18858ee from 2 (mariadb-1)
2023-12-05 23:53:14 0 [Note] WSREP: STATE EXCHANGE: got state msg: 742e9eca-93c9-11ee-9732-16a8b18858ee from 3 (mariadb-0)
2023-12-05 23:53:14 0 [Note] WSREP: STATE EXCHANGE: got state msg: 742e9eca-93c9-11ee-9732-16a8b18858ee from 4 (mariadb-2)
2023-12-05 23:53:14 1 [Note] WSREP: Starting rollbacker thread 1
2023-12-05 23:53:14 2 [Note] WSREP: Starting applier thread 2
2023-12-05 23:53:14 0 [Note] WSREP: Deinitializing config service v1
2023-12-05 23:53:14 0 [Note] WSREP: STATE EXCHANGE: got state msg: 742e9eca-93c9-11ee-9732-16a8b18858ee from 1 (mariadb-4)
2023-12-05 23:53:14 0 [Note] WSREP: Quorum results:
    version    = 6,
    component  = PRIMARY,
    conf_id    = 358,
    members    = 4/5 (joined/total),
    act_id     = 13314,
    last_appl. = 13202,
    protocols  = 2/10/4 (gcs/repl/appl),
    vote policy= 0,
    group UUID = 82507f11-767c-11ee-b214-532d45cfffd9
2023-12-05 23:53:14 0 [Note] WSREP: Flow-control interval: [36, 36]
2023-12-05 23:53:14 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 13315)
2023-12-05 23:53:14 2 [Note] WSREP: ####### processing CC 13315, local, ordered
2023-12-05 23:53:14 2 [Note] WSREP: Process first view: 82507f11-767c-11ee-b214-532d45cfffd9 my uuid: 72a61da0-93c9-11ee-93c4-479fd91a36f4
2023-12-05 23:53:14 2 [Note] WSREP: Server mariadb-4 connected to cluster at position 82507f11-767c-11ee-b214-532d45cfffd9:13315 with ID 72a61da0-93c9-11ee-93c4-479fd91a36f4
2023-12-05 23:53:14 2 [Note] WSREP: Server status change disconnected -> connected
2023-12-05 23:53:14 2 [Note] WSREP: ####### My UUID: 72a61da0-93c9-11ee-93c4-479fd91a36f4
2023-12-05 23:53:14 2 [Note] WSREP: Cert index reset to 00000000-0000-0000-0000-000000000000:-1 (proto: 10), state transfer needed: yes
2023-12-05 23:53:14 0 [Note] WSREP: Service thread queue flushed.
2023-12-05 23:53:14 2 [Note] WSREP: ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:-1, protocol version: -1
2023-12-05 23:53:14 2 [Note] WSREP: State transfer required:
    Group state: 82507f11-767c-11ee-b214-532d45cfffd9:13315
    Local state: 00000000-0000-0000-0000-000000000000:-1
2023-12-05 23:53:14 2 [Note] WSREP: Server status change connected -> joiner
2023-12-05 23:53:14 0 [Note] WSREP: Joiner monitor thread started to monitor
2023-12-05 23:53:14 0 [Note] WSREP: Running: 'wsrep_sst_mariabackup --role 'joiner' --address 'mariadb-4.mariadb-internal.databases.svc.newcluster.local' --datadir '/var/lib/mysql/' --parent 1 --progress 0'
WSREP_SST: [INFO] mariabackup SST started on joiner (20231205 23:53:14.665)
WSREP_SST: [INFO] SSL configuration: CA='', CAPATH='', CERT='', KEY='', MODE='DISABLED', encrypt='0' (20231205 23:53:14.721)
WSREP_SST: [INFO] Progress reporting tool pv not found in path: /usr//bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/sbin:/usr/bin:/sbin:/bin (20231205 23:53:14.856)
WSREP_SST: [INFO] Disabling all progress/rate-limiting (20231205 23:53:14.859)
WSREP_SST: [INFO] Streaming with mbstream (20231205 23:53:14.886)
WSREP_SST: [INFO] Using socat as streamer (20231205 23:53:14.890)
WSREP_SST: [INFO] Stale sst_in_progress file: /var/lib/mysql/sst_in_progress (20231205 23:53:14.895)
WSREP_SST: [INFO] previous SST is not completed, waiting for it to exit (20231205 23:53:14.960)
2023-12-05 23:53:15 0 [Note] WSREP: (72a61da0-93c4, 'tcp://0.0.0.0:4567') turning message relay requesting off
WSREP_SST: [INFO] previous SST is not completed, waiting for it to exit (20231205 23:53:15.972)
WSREP_SST: [INFO] previous SST is not completed, waiting for it to exit (20231205 23:53:16.983)
WSREP_SST: [INFO] previous SST is not completed, waiting for it to exit (20231205 23:53:17.994)
WSREP_SST: [INFO] previous SST is not completed, waiting for it to exit (20231205 23:53:19.006)
WSREP_SST: [INFO] previous SST is not completed, waiting for it to exit (20231205 23:53:20.016)
WSREP_SST: [INFO] previous SST is not completed, waiting for it to exit (20231205 23:53:21.027)
WSREP_SST: [INFO] previous SST is not completed, waiting for it to exit (20231205 23:53:22.038)
WSREP_SST: [INFO] previous SST is not completed, waiting for it to exit (20231205 23:53:23.050)
WSREP_SST: [INFO] previous SST is not completed, waiting for it to exit (20231205 23:53:24.061)
WSREP_SST: [ERROR] previous SST script still running. (20231205 23:53:24.064)
2023-12-05 23:53:24 0 [ERROR] WSREP: Failed to read 'ready <addr>' from: wsrep_sst_mariabackup --role 'joiner' --address 'mariadb-4.mariadb-internal.databases.svc.newcluster.local' --datadir '/var/lib/mysql/' --parent 1 --progress 0
    Read: '(null)'
2023-12-05 23:53:24 0 [ERROR] WSREP: Process completed with error: wsrep_sst_mariabackup --role 'joiner' --address 'mariadb-4.mariadb-internal.databases.svc.newcluster.local' --datadir '/var/lib/mysql/' --parent 1 --progress 0: 114 (Operation already in progress)
2023-12-05 23:53:24 2 [ERROR] WSREP: Failed to prepare for 'mariabackup' SST. Unrecoverable.
2023-12-05 23:53:24 2 [ERROR] WSREP: SST request callback failed. This is unrecoverable, restart required.
2023-12-05 23:53:24 2 [Note] WSREP: ReplicatorSMM::abort()
2023-12-05 23:53:24 2 [Note] WSREP: Closing send monitor...
2023-12-05 23:53:24 2 [Note] WSREP: Closed send monitor.
2023-12-05 23:53:24 2 [Note] WSREP: gcomm: terminating thread
2023-12-05 23:53:24 2 [Note] WSREP: gcomm: joining thread
2023-12-05 23:53:24 2 [Note] WSREP: gcomm: closing backend
2023-12-05 23:53:24 2 [Note] WSREP: view(view_id(NON_PRIM,37718ade-864d,383) memb {
    72a61da0-93c4,0
} joined {
} left {
} partitioned {
    37718ade-864d,0
    a0738f2d-bcd9,0
    ac774553-8db1,0
    c017ceee-af4a,0
})
2023-12-05 23:53:24 2 [Note] WSREP: PC protocol downgrade 1 -> 0
2023-12-05 23:53:24 2 [Note] WSREP: view((empty))
2023-12-05 23:53:24 2 [Note] WSREP: gcomm: closed
2023-12-05 23:53:24 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2023-12-05 23:53:24 0 [Note] WSREP: Flow-control interval: [16, 16]
2023-12-05 23:53:24 0 [Note] WSREP: Received NON-PRIMARY.
2023-12-05 23:53:24 0 [Note] WSREP: Shifting PRIMARY -> OPEN (TO: 13315)
2023-12-05 23:53:24 0 [Note] WSREP: New SELF-LEAVE.
2023-12-05 23:53:24 0 [Note] WSREP: Flow-control interval: [0, 0]
2023-12-05 23:53:24 0 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2023-12-05 23:53:24 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 13315)
2023-12-05 23:53:24 0 [Note] WSREP: RECV thread exiting 0: Success
2023-12-05 23:53:24 2 [Note] WSREP: recv_thread() joined.
2023-12-05 23:53:24 2 [Note] WSREP: Closing replication queue.
2023-12-05 23:53:24 2 [Note] WSREP: Closing slave action queue.
2023-12-05 23:53:24 2 [Note] WSREP: mariadbd: Terminated.
231205 23:53:24 [ERROR] mysqld got signal 11 ;
Sorry, we probably made a mistake, and this is a bug.
Your assistance in bug reporting will enable us to fix this for the next release.
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Server version: 11.2.2-MariaDB-1:11.2.2+maria~ubu2204 source revision: 929532a9426d085111c24c63de9c23cc54382259
key_buffer_size=0
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=3
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 337017 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x7f6394000c68
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f63b81f9c68 thread_stack 0x49000
Printing to addr2line failed
mariadbd(my_print_stacktrace+0x32)[0x55ffec5f2032]
mariadbd(handle_fatal_signal+0x478)[0x55ffec0c6158]
/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f63b9740520]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x178)[0x7f63b9726898]
/usr/lib/galera/libgalera_smm.so(+0x157602)[0x7f63b91c2602]
/usr/lib/galera/libgalera_smm.so(+0x700e1)[0x7f63b90db0e1]
/usr/lib/galera/libgalera_smm.so(+0x6cc94)[0x7f63b90d7c94]
/usr/lib/galera/libgalera_smm.so(+0x8b311)[0x7f63b90f6311]
/usr/lib/galera/libgalera_smm.so(+0x604a0)[0x7f63b90cb4a0]
/usr/lib/galera/libgalera_smm.so(+0x48261)[0x7f63b90b3261]
mariadbd(_ZN5wsrep18wsrep_provider_v2611run_applierEPNS_21high_priority_serviceE+0x12)[0x55ffec6b0b02]
mariadbd(+0xd7ff31)[0x55ffec383f31]
mariadbd(_Z15start_wsrep_THDPv+0x26b)[0x55ffec371cfb]
mariadbd(+0xcf24c6)[0x55ffec2f64c6]
/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f63b9792ac3]
/lib/x86_64-linux-gnu/libc.so.6(+0x126a40)[0x7f63b9824a40]
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x0): (null)
Connection ID (thread ID): 2
Status: NOT_KILLED
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off,hash_join_cardinality=on,cset_narrowing=off
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mariadbd/ contains
information that should help you find out what is causing the crash.
We think the query pointer is invalid, but we will try to print it anyway.
Query:
Writing a core file...
Working directory at /var/lib/mysql
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    0                    bytes
Max resident set          unlimited            unlimited            bytes
Max processes             unlimited            unlimited            processes
Max open files            65535                65535                files
Max locked memory         unlimited            unlimited            bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       128442               128442               signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us
Core pattern: core
Kernel version: Linux version 5.10.0-25-amd64 (debian-kernel@lists.debian.org) (gcc-10 (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP Debian 5.10.191-1 (2023-08-16)
mmontes11 commented 12 months ago

Hey

Is it possible that during reconcile, the controller is "erroring" as it's waiting for a cert from certmanager (couldn't find the certmanager code, my fault lol) and then creating the key here?

Our cert-controller is not deployed if cert-manager is enabled since they serve the same purpose, so there is no we they can clash.

I'm really not sure why it's just creating CertificateRequest after CertificateRequest, instead of waiting for certmanager to pick them up?

As far as I know, cert-manager tracks attemtps to renew a Certificate object in `CertificateRequests, therefore what may be happening is that cert-manager is considering your certificate outdated or invalid. This is most likely a cert-manager issue with your installation I would say. Maybe something to report upstream?:

However, after changing the spec.image from 11.0.3 to 11.2.2, I just get the following error over and over again as it tries to do a rolling update...

Can we handle the Galera issue separately in another issue? Also, did you have the chance to look at the troubleshooting guide?:

perfectra1n commented 12 months ago

Understood, I use cert-manager throughout my environment, so not sure why it doesn’t play well here.

The solution I had regarding the webhook was just to avoid use cert-manager within the Helm release since it was constantly looping :) so I think we’re good to close this issue now! Unless you want to debug the cert-manager issue?

mmontes11 commented 12 months ago

Unless you want to debug the cert-manager issue?

Leave it open, I will try to reproduce it.

One possibility could be that you are in an intermediate state where the Secrets used by the cert-controller and managed by helm are still in the cluster. By default they are empty and they are named the same as the ones generated by cert-managed, something that might be creating conflicts. Could you confirm if you see them?:

perfectra1n commented 12 months ago

Sure! So with the following values:

clusterName: "newcluster.local"
ha:
  enabled: true
logLevel: DEBUG
webhook:
  cert:
    certManager:
      enabled: true

I see that the deployment is in that "loop" of Secrets vs. Certificates: image

I see the following resources: image

With the mariadb-op-mariadb-operator-webhook-cert being recreated over and over again. It has the following values within the Secret: image

With the mariadb-operator-webhook-ca having the following: image

While this is going on, the Certificate resource named mariadb-op-mariadb-operator-webhook-cert also now exists: image image

Let me know if there's anything else you would like to see!

perfectra1n commented 12 months ago

Interestingly enough too, I'm getting the following "errors" on my cluster: image

github-actions[bot] commented 11 months ago

This issue is stale because it has been open 30 days with no activity.

perfectra1n commented 10 months ago

Up to you, if you want to keep this open or not @mmontes11. I had to just nuke this DB (backup, destroy, restore) and start from scratch. After doing so, this issue went away more or less...

DrZoidberg09 commented 10 months ago

Same issue here. After reboot of some nodes it ends up that only one of three galera nodes are up. This is highly concerning to me. To me it seems that if in case of reboot the webhook is not available it stops working even if the webhook is abailable again shortly after.

ShakataGaNai commented 9 months ago

I ran across what I think is the same issue, when enabling webhook.cert.certManager.enabled, it repeatedly attempted to request certs, creating/delete secrets. Brand new cluster, so I deleted both cert-manager and mariadb operator and it kept happening.

Config's (doing the app-of-apps model):

cert-manager.yaml

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: cert-manager
  namespace: argocd
  finalizers:
  - resources-finalizer.argocd.argoproj.io

spec:
  project: default
  source:
    repoURL: {{ .Values.spec.source.repoURL }}
    path: src/cert-manager
    targetRevision: {{ .Values.spec.source.targetRevision }}
  destination:
    server: {{ .Values.spec.destination.server }}
    namespace: cert-manager

  syncPolicy:
    syncOptions:
    - CreateNamespace=true

    automated:
      selfHeal: true
      prune: true

maraiadb-operator.yaml

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: mariadb-operator
  namespace: argocd
  finalizers:
  - resources-finalizer.argocd.argoproj.io

spec:
  project: default
  source:
    chart: mariadb-operator
    repoURL: https://mariadb-operator.github.io/mariadb-operator
    targetRevision: ">=0"
    helm:
      releaseName: mariadb-operator
      parameters:
      - name: "webhook.cert.certManager.enabled"
        value: "true"
  destination:
    server: {{ .Values.spec.destination.server }}
    namespace: mariadb-operator

  syncPolicy:
    syncOptions:
    - CreateNamespace=true
    - ServerSideApply=true
    # https://github.com/argoproj/argo-cd/issues/820#issuecomment-1246960210
    # May also just need to force replace the failed-to-launch.

    automated:
      selfHeal: true
      prune: true

Since there doesn't seem to be some sort of conclusive answer as to what was going on I turned off the use of cert-manager.

Versions:

mmontes11 commented 9 months ago

Hey @ShakataGaNai ! Thanks for reporting, I still didn't have too much time to reproduce this sorry. So far I didn't manage to reproduce this with flux.

I have 3 possible investigation paths:

  1. A regression in the Helm release, renders some resource that clashes with cert-manager and makes it think that the cert needs to be renewed. For example, we are creating thing secret to be used with mariadb-operator's cert-controller, but it shouldn't be rendered with cert-manager enabled
  2. Something related to ArgoCD, which unfortunately I don't have a lot of knowledge. So we have a helm chart, how and when ArgoCD tries to render the helm chart? Your ArgoCD experience may be helpful here.
  3. Something related to cert-manager or to a cert-manager issuer which for some reason does not inject the ca.crt in the secret. Therefore the webhook can't trust the connection

Happy to hear your thoughts!

mmontes11 commented 9 months ago

In regards to point 2. from my previous comment, this might be related:

Would be great to hear from an ArgoCD expert.

jescarri commented 9 months ago

This is related to the helm chart bug https://github.com/mariadb-operator/mariadb-operator/issues/375 @ShakataGaNai you can remove the prune: true syncPolicy as a temp workaround, caveat is that argocd will look as it is out of sync but webhook will work.

mmontes11 commented 9 months ago

We have just merged @jescarri's PR with a fix for this:

Closing! This will be released in v0.0.26 this week, feel free to reopen. Pleas consider @jescarri's advice about ArgoCD: https://github.com/mariadb-operator/mariadb-operator/issues/285#issuecomment-1939536676

perfectra1n commented 7 months ago

I also had to add the following to the values of my Helm deployment when using certManager:

        webhook:
          cert:
            secretLabels: 
              key1: value1

To have it stop complaining about a null value not being allowed for spec.SecretTemplate. This commit led me to that resolution.

mmontes11 commented 7 months ago

To have it stop complaining about a null value not being allowed for spec.SecretTemplate. This commit led me to that resolution.

Would you mind open a separate issue for that with the logs attached?