Closed tflabs-nl closed 1 year ago
Hey there @tflabs-nl ! thanks for reporting this.
The operator should cover this situation by performing the cluster recovery, let's dig a bit deeper into this
For context, some users experienced similar problems, sometimes Galera requires extra configuration depending on the cluster, please take a look
2023-08-02 7:09:35 0 [ERROR] WSREP: ./gcs/src/gcs.cpp:gcs_open():1669: Failed to open channel 'mariadb-operator' at 'gcomm://mariadb-galera-dev-0.mariadb-galera-dev-db-operator.svc.cluster.local,mariadb-galera-dev-1.mariadb-galera-dev-internal.mariadb-operator.svc.cluster.local,mariadb-galera-dev-2.mariadb-galera-dev-internal.marsvc.cluster.local': -110 (Connection timed out)
To begin with, this connection timeout suggests that the peers of the cluster are unable to reach each other. Can you confirm that you have Endpoints
by running the following command?
kubectl get endpoints mariadb-galera-internal -n default -o yaml
The next thing to do is checking why the operator is unable to bootstrap a new cluster, could you enable the debug logs and send the output? See this comment for reference
Thank you for your reply @mmontes11 !
~$ kubectl get endpoints -n mariadb-operator mariadb-galera-dev-internal -o yaml
apiVersion: v1
kind: Endpoints
metadata:
creationTimestamp: "2023-08-02T14:36:53Z"
labels:
app.kubernetes.io/instance: mariadb-galera-dev
app.kubernetes.io/name: mariadb
service.kubernetes.io/headless: ""
name: mariadb-galera-dev-internal
namespace: mariadb-operator
resourceVersion: "16906138"
uid: c49cfbfa-18ba-4d5b-8ede-e15a5c86e5f1
subsets:
- addresses:
- hostname: mariadb-galera-dev-1
ip: 10.244.2.77
nodeName: c01-b01-am5-redacted-nl
targetRef:
kind: Pod
name: mariadb-galera-dev-1
namespace: mariadb-operator
uid: d1d97e7b-7c4b-44bc-a6f0-82b0c591bb78
- hostname: mariadb-galera-dev-0
ip: 10.244.4.86
nodeName: c01-b03-am5-redacted-nl
targetRef:
kind: Pod
name: mariadb-galera-dev-0
namespace: mariadb-operator
uid: 978f67fa-0446-479c-9ba2-58af8d22cfcc
- hostname: mariadb-galera-dev-2
ip: 10.244.5.171
nodeName: c01-b04-am5-redacted-nl
targetRef:
kind: Pod
name: mariadb-galera-dev-2
namespace: mariadb-operator
uid: 9ef243b2-4eab-468a-8014-f1c7aaf6bbb8
ports:
- name: sst
port: 4568
protocol: TCP
- name: ist
port: 4567
protocol: TCP
- name: mariadb
port: 3306
protocol: TCP
- name: agent
port: 5555
protocol: TCP
- name: cluster
port: 4444
protocol: TCP
I decided to do a clean run to verify that it had nothing to do with an old version I upgraded from / me changing specs before the first init on all machines could complete. Unfortunately, the pod I was actively connected to during the import (mariadb-galera-dev-2) crashed during the new import.
Mysql Client reported: ERROR 1180 (HY000) at line 1973: Got error 6 "No such device or address" during COMMIT
Fortunately, this time, I had setup the ELK stack before any crashes on a brand new deployment. This log contains all the crashes ever since. Most interesting is what happened before line 260
( {"level":"info","ts":1690994798.0681577,"logger":"server","msg":"shutting down server"} )
Hope this helps! Should I provide additional logs from the operator itself during the same time period?
By now all 3 of the replica's have crashed. Logs of the operator are full with:
{"level":"error","ts":1691061971.720084,"msg":"Reconciler error","controller":"connection","controllerGroup":"mariadb.mmontes.io","controllerKind":"Connection","connection":{"name":"mariadb-galera-dev","namespace":"mariadb-operator"},"namespace":"mariadb-operator","name":"mariadb-galera-dev","reconcileID":"22eb3850-91b4-4075-a039-efb80ee1ed69","error":"MariaDB not ready","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1691061971.7470686,"msg":"Reconciler error","controller":"connection","controllerGroup":"mariadb.mmontes.io","controllerKind":"Connection","connection":{"name":"mariadb-galera-dev","namespace":"mariadb-operator"},"namespace":"mariadb-operator","name":"mariadb-galera-dev","reconcileID":"cda8b669-530a-4a5f-b8e2-587d6640163c","error":"MariaDB not ready","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1691061971.7836084,"msg":"Reconciler error","controller":"connection","controllerGroup":"mariadb.mmontes.io","controllerKind":"Connection","connection":{"name":"mariadb-galera-dev","namespace":"mariadb-operator"},"namespace":"mariadb-operator","name":"mariadb-galera-dev","reconcileID":"506204ad-feee-4b41-8680-1625d873551b","error":"MariaDB not ready","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1691061971.835254,"msg":"Reconciler error","controller":"connection","controllerGroup":"mariadb.mmontes.io","controllerKind":"Connection","connection":{"name":"mariadb-galera-dev","namespace":"mariadb-operator"},"namespace":"mariadb-operator","name":"mariadb-galera-dev","reconcileID":"2019235e-01b0-4e28-abe2-52cf1c505824","error":"MariaDB not ready","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1691061971.9257667,"msg":"Reconciler error","controller":"connection","controllerGroup":"mariadb.mmontes.io","controllerKind":"Connection","connection":{"name":"mariadb-galera-dev","namespace":"mariadb-operator"},"namespace":"mariadb-operator","name":"mariadb-galera-dev","reconcileID":"9b13d159-8a5a-4df2-a4e7-83ca666767a0","error":"MariaDB not ready","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1691061972.1077244,"msg":"Reconciler error","controller":"connection","controllerGroup":"mariadb.mmontes.io","controllerKind":"Connection","connection":{"name":"mariadb-galera-dev","namespace":"mariadb-operator"},"namespace":"mariadb-operator","name":"mariadb-galera-dev","reconcileID":"50f0d2f9-dfc4-4f33-8ac6-5cd866804b47","error":"MariaDB not ready","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1691061972.4478083,"msg":"Reconciler error","controller":"connection","controllerGroup":"mariadb.mmontes.io","controllerKind":"Connection","connection":{"name":"mariadb-galera-dev","namespace":"mariadb-operator"},"namespace":"mariadb-operator","name":"mariadb-galera-dev","reconcileID":"5784aedd-ed2e-464f-b1fd-10336845ebc3","error":"MariaDB not ready","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1691061973.107079,"msg":"Reconciler error","controller":"connection","controllerGroup":"mariadb.mmontes.io","controllerKind":"Connection","connection":{"name":"mariadb-galera-dev","namespace":"mariadb-operator"},"namespace":"mariadb-operator","name":"mariadb-galera-dev","reconcileID":"7a9de322-2cc0-4cd2-aa80-221527e210f9","error":"MariaDB not ready","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1691061974.4047112,"msg":"Reconciler error","controller":"connection","controllerGroup":"mariadb.mmontes.io","controllerKind":"Connection","connection":{"name":"mariadb-galera-dev","namespace":"mariadb-operator"},"namespace":"mariadb-operator","name":"mariadb-galera-dev","reconcileID":"3f122380-a9ea-4e3f-88bf-4486292f8660","error":"MariaDB not ready","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1691061976.9765716,"msg":"Reconciler error","controller":"connection","controllerGroup":"mariadb.mmontes.io","controllerKind":"Connection","connection":{"name":"mariadb-galera-dev","namespace":"mariadb-operator"},"namespace":"mariadb-operator","name":"mariadb-galera-dev","reconcileID":"9d9f3de4-e88d-4693-89a2-1f391cde6478","error":"MariaDB not ready","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1691061982.1107109,"msg":"Reconciler error","controller":"connection","controllerGroup":"mariadb.mmontes.io","controllerKind":"Connection","connection":{"name":"mariadb-galera-dev","namespace":"mariadb-operator"},"namespace":"mariadb-operator","name":"mariadb-galera-dev","reconcileID":"7e31b09a-120d-48c9-ad47-0a5ba2707e37","error":"MariaDB not ready","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1691061992.3668597,"msg":"Reconciler error","controller":"connection","controllerGroup":"mariadb.mmontes.io","controllerKind":"Connection","connection":{"name":"mariadb-galera-dev","namespace":"mariadb-operator"},"namespace":"mariadb-operator","name":"mariadb-galera-dev","reconcileID":"06c653c4-75f6-400e-b858-8f4c7d46b45e","error":"MariaDB not ready","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1691062012.8634171,"msg":"Reconciler error","controller":"connection","controllerGroup":"mariadb.mmontes.io","controllerKind":"Connection","connection":{"name":"mariadb-galera-dev","namespace":"mariadb-operator"},"namespace":"mariadb-operator","name":"mariadb-galera-dev","reconcileID":"63aa7cf8-6135-4929-8012-7b2166d72942","error":"MariaDB not ready","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1691062053.837543,"msg":"Reconciler error","controller":"connection","controllerGroup":"mariadb.mmontes.io","controllerKind":"Connection","connection":{"name":"mariadb-galera-dev","namespace":"mariadb-operator"},"namespace":"mariadb-operator","name":"mariadb-galera-dev","reconcileID":"2f8ce39e-54c3-4b1a-82b5-4a3a7c86b3f8","error":"MariaDB not ready","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1691062135.772354,"msg":"Reconciler error","controller":"connection","controllerGroup":"mariadb.mmontes.io","controllerKind":"Connection","connection":{"name":"mariadb-galera-dev","namespace":"mariadb-operator"},"namespace":"mariadb-operator","name":"mariadb-galera-dev","reconcileID":"d3609911-749d-45dd-8ba3-36e3100c8fc0","error":"MariaDB not ready","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1691062299.6243398,"msg":"Reconciler error","controller":"connection","controllerGroup":"mariadb.mmontes.io","controllerKind":"Connection","connection":{"name":"mariadb-galera-dev","namespace":"mariadb-operator"},"namespace":"mariadb-operator","name":"mariadb-galera-dev","reconcileID":"655eb929-5eb1-474a-92a3-7ba4e5889859","error":"MariaDB not ready","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1691062627.3180552,"msg":"Reconciler error","controller":"connection","controllerGroup":"mariadb.mmontes.io","controllerKind":"Connection","connection":{"name":"mariadb-galera-dev","namespace":"mariadb-operator"},"namespace":"mariadb-operator","name":"mariadb-galera-dev","reconcileID":"d13d92bb-7936-4437-a3d2-9a43adb8f457","error":"MariaDB not ready","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234"}
{"level":"error","ts":1691063282.6937213,"msg":"Reconciler error","controller":"connection","controllerGroup":"mariadb.mmontes.io","controllerKind":"Connection","connection":{"name":"mariadb-galera-dev","namespace":"mariadb-operator"},"namespace":"mariadb-operator","name":"mariadb-galera-dev","reconcileID":"31330894-0554-47fe-bb7c-e7312000ec1e","error":"MariaDB not ready","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.12.3/pkg/internal/controller/controller.go:234"}
Hey there @tflabs-nl ! Thanks a lot for the details
I was carefully taking a look to the logs you sent, but unfortunately they don't shed light about the question I want to answer, which is why the operator is not recovering the cluster? We know for sure that MariaDB is under pressure leading eventually to crashes and also getting out of sync because it cannot cope with the load. But this is a different topic, this requires tuning up spec.resources
. I would suggest temporary increasing them to be able to restore the 13GB backup.
{"level":"info","ts":1690994798.0681577,"logger":"server","msg":"shutting down server"} )
These are the sidecar agent logs when the Pod
gets killed, either by Kubernetes for reaching memory limits or because the spec.galera.recovery.podRecoveryTimeout
is reached and therefore the operator attempts to restart the Pod
.
Now, regarding why the operator is not recovering the cluster, we need more light to reach a conclusion, so I would suggest the following:
Events
of the MariaDB
object:
kubectl get events --field-selector involvedObject.name=mariadb-galera -n mariadb --sort-by='.lastTimestamp'
CrashLoopBackoff
MariaDB
Pod
container, not only of the main mariadb
container but also the agent
and init
ones.We experienced similar problems in the following issues, you may check them for context:
As a takeaway from this issues, I will be adding a troubleshoot section to the GALERA.md
Hi @mmontes11 thank you for checking this out!
I've set the operator log level to debug in the deployment and verified that it got properly applied.
Also, I changed the recovery times to the following:
{"clusterBootstrapTimeout":"50m0s","clusterHealthyTimeout":"15m0s","enabled":true,"podRecoveryTimeout":"30m0s","podSyncTimeout":"30m0s"}
Meanwhile I noticed this line in the logs:
{"level":"error","ts":1691069098.21376,"logger":"handler.recovery","msg":"error recovering galera from recovery log","error":"error unmarshaling bootstrap: unable to find uuid and seqno: uuid=<nil> seqno=<nil>","stacktrace":"k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.24.2/pkg/util/wait/wait.go:233\nk8s.io/apimachinery/pkg/util/wait.WaitForWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.24.2/pkg/util/wait/wait.go:660\nk8s.io/apimachinery/pkg/util/wait.poll\n\t/go/pkg/mod/k8s.io/apimachinery@v0.24.2/pkg/util/wait/wait.go:594\nk8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.24.2/pkg/util/wait/wait.go:545\ngithub.com/mariadb-operator/agent/pkg/handler.(*Recovery).pollUntilRecovered\n\t/app/pkg/handler/recovery.go:118\ngithub.com/mariadb-operator/agent/pkg/handler.(*Recovery).Post\n\t/app/pkg/handler/recovery.go:90\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2122\ngithub.com/go-chi/chi/v5.(*Mux).routeHTTP\n\t/go/pkg/mod/github.com/go-chi/chi/v5@v5.0.8/mux.go:444\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2122\ngithub.com/go-chi/chi/v5.(*Mux).ServeHTTP\n\t/go/pkg/mod/github.com/go-chi/chi/v5@v5.0.8/mux.go:73\ngithub.com/go-chi/chi/v5.(*Mux).Mount.func1\n\t/go/pkg/mod/github.com/go-chi/chi/v5@v5.0.8/mux.go:316\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2122\ngithub.com/go-chi/chi/v5.(*Mux).routeHTTP\n\t/go/pkg/mod/github.com/go-chi/chi/v5@v5.0.8/mux.go:444\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2122\ngithub.com/mariadb-operator/agent/pkg/kubernetesauth.(*KubernetesAuth).Handler.func1\n\t/app/pkg/kubernetesauth/kubernetesauth.go:76\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2122\ngithub.com/go-chi/chi/v5/middleware.RequestLogger.func1.1\n\t/go/pkg/mod/github.com/go-chi/chi/v5@v5.0.8/middleware/logger.go:54\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2122\ngithub.com/go-chi/chi/v5.(*Mux).ServeHTTP\n\t/go/pkg/mod/github.com/go-chi/chi/v5@v5.0.8/mux.go:73\ngithub.com/go-chi/chi/v5.(*Mux).Mount.func1\n\t/go/pkg/mod/github.com/go-chi/chi/v5@v5.0.8/mux.go:316\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2122\ngithub.com/go-chi/chi/v5.(*Mux).routeHTTP\n\t/go/pkg/mod/github.com/go-chi/chi/v5@v5.0.8/mux.go:444\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2122\ngithub.com/go-chi/chi/v5/middleware.Recoverer.func1\n\t/go/pkg/mod/github.com/go-chi/chi/v5@v5.0.8/middleware/recoverer.go:43\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2122\ngithub.com/go-chi/chi/v5/middleware.(*Compressor).Handler.func1\n\t/go/pkg/mod/github.com/go-chi/chi/v5@v5.0.8/middleware/compress.go:213\nnet/http.HandlerFunc.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2122\ngithub.com/go-chi/chi/v5.(*Mux).ServeHTTP\n\t/go/pkg/mod/github.com/go-chi/chi/v5@v5.0.8/mux.go:90\nnet/http.serverHandler.ServeHTTP\n\t/usr/local/go/src/net/http/server.go:2936\nnet/http.(*conn).serve\n\t/usr/local/go/src/net/http/server.go:1995"}
probably because of unclean shutdown on (eventually) all nodes no last commit/record UUID is stored. What is the best way to bootstrap a cluster in which all nodes failed using the operator?
Meanwhile this piece of debugging info is recorded by the operator:
{"level":"debug","ts":1691069359.204221,"logger":"galera.recovery.cluster","msg":"Error polling","controller":"mariadb","controllerGroup":"mariadb.mmontes.io","controllerKind":"MariaDB","mariaDB":{"name":"mariadb-galera-dev","namespace":"mariadb-operator"},"namespace":"mariadb-operator","name":"mariadb-galera-dev","reconcileID":"4cedcf81-e3e4-45e3-86e5-c9e36c196d0a","err":"error doing request: Post \"http://mariadb-galera-dev-2.mariadb-galera-dev-internal.mariadb-operator.svc.cluster.local:5555/api/recovery\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
Hey there @tflabs-nl ! Thanks for investigating, we are making progress here, as this confirms that mariadb-operator
is actually attempting to perform the recovery.
{"clusterBootstrapTimeout":"50m0s","clusterHealthyTimeout":"15m0s","enabled":true,"podRecoveryTimeout":"30m0s","podSyncTimeout":"30m0s"}
this sounds sensible, I actually increased the defaults yesterday as well
error unmarshaling bootstrap: unable to find uuid and seqno: uuid=
seqno=
This is logged by the agent
container when it attempts to parse the uuid and seqno from /var/lib/mysql/mariadb.err
, here:
https://github.com/mariadb-operator/agent/blob/098241e9f786b87be539d70a5ecaa5b186dd64d7/pkg/galera/galera.go#L181
So basically the operator puts a Pod
in recovery mode, meaning that it will be restarting and potentially outputing a line in
/var/lib/mysql/mariadb.err
starting with WSREP: Recovered position:
, which is what the agent
is trying to parse. More info can be found at:
https://galeracluster.com/library/documentation/crash-recovery.html
The problem is that for some reason your MariaDB instance is unable to recover the uuid
and seqno
after setting the recovery mode, outputting something like this:
WSREP: Recovered position: 220dcdcb-1629-11e4-add3-aec059ad3734:1122
Instead, your mariadb.err
must be empty or corrupt, leading to the following error:
error unmarshaling bootstrap: unable to find uuid and seqno: uuid=<nil> seqno=<nil>
It would be great if you could grab /var/lib/mysql/mariadb.err
to see what we are dealing with.
Maybe it has to do with the configuration somehow?
[mysqld]
bind-address=0.0.0.0
default_storage_engine=InnoDB
binlog_format=row
innodb_autoinc_lock_mode=2
max_allowed_packet=512M
relay-log = relay-bin
log-slave-updates = 1
expire_logs_days = 7
max_binlog_size = 1000M
## Safety
##
skip_host_cache
skip_name_resolve
max_allowed_packet=16M
max_connect_errors=1000000
sql_mode=STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_AUTO_VALUE_ON_ZERO,NO_ENGINE_SUBSTITUTION,NO_ZERO_DATE,NO_ZERO_IN_DATE,ONLY_FULL_GROUP_BY
sysdate_is_now=1
log_bin=mysql-bin
expire_logs_days=14
# Disabling for performance per http://severalnines.com/blog/9-tips-going-production-galera-cluster-mysql
sync_binlog=0
## Caches and Limits
tmp_table_size=32M
max_heap_table_size=32M
# Re-enabling as now works with Maria 10.1.2
query_cache_type=1
query_cache_limit=4M
query_cache_size=256M
max_connections=3000
thread_handling = pool-of-threads
thread_stack = 192K
thread_cache_size = 4
thread_pool_size = 8
thread_cache_size=50
thread_pool_oversubscribe = 3
open_files_limit=65535
table_definition_cache=4096
table_open_cache=4096
## InnoDB
#innodb=FORCE
innodb_strict_mode=1
innodb_autoinc_lock_mode=2
innodb_doublewrite=1
innodb_flush_method=O_DIRECT_NO_FSYNC
innodb_log_files_in_group=2
innodb_log_file_size=1G
innodb_flush_log_at_trx_commit=2
innodb_file_per_table=1
# 80% Memory is default reco.
# Need to re-evaluate when DB size grows
innodb_buffer_pool_size=12G
innodb_buffer_pool_instances=8
innodb_file_format=Barracuda
innodb_read_io_threads=8
innodb_write_io_threads=4
innodb_log_buffer_size = 512M
cc @grooverdan in case you have any ideas. sync_binlog=0
might not be helping with the consistency.
What is the best way to bootstrap a cluster in which all nodes failed using the operator?
to answer your question, recovering a cluster with all failed nodes is what the operator is attempting to do here, but it needs the seqnos and uuids to know which node should be used to bootstrap a new cluster.
Hi @mmontes11 ! I have to agree with you that disabling binlog syncing is not the best in terms of consistency.
I saw a blog post regarding mariadb galera optimizations and followed that.
Said blog post
I first had some questions regarding the problems that may arise from not enabling sync_binlog
but found the following blog post:
trx commit 2 /w sync binlog evil
Thereby figuring that 'a couple transactions' would be fine, in our case. (As I understood, this means a couple of transactions on pod failure, which should be rare)
I seem to be unable to overwrite the arguments of the agent. (spec: Forbidden: pod updates may not change fields other than...
) Also the agent doesn't have bash or sh, so I cannot execute into the pod. The mariadb container does not seem to be started, so I cannot execute any commands in there either. Maybe it would be an idea to create a second image that is based on ubuntu or any other image that has some basic support for debugging?
Is there any other way I can cat the /var/lib/mysql/mariadb.err
file without having to destroy (or in this case possibly rescale) the mariadb CR and mounting the PVC in another pod?
Hey @tflabs-nl !
Maybe it would be an idea to create a second image that is based on ubuntu or any other image that has some basic support for debugging?
This will work for sure
Is there any other way I can cat the /var/lib/mysql/mariadb.err file without having to destroy (or in this case possibly rescale) the mariadb CR and mounting the PVC in another pod?
I was giving ephemeral containers a try, but unfortunatelly AFAIK they don't support volumeMounts
:
kubectl debug -it mariadb-galera-0 --image=busybox --copy-to=mariadb-debug
As an alternative, you can open a debugging shell on the Node with the filesystem mounted at /host
:
kubectl debug node/mdb-worker3 -it --image=busybox
Hi @mmontes11 !
I tried to access the container's filesystem via a the host by looking up the Container ID via kubectl describe pod <podname>
and then searching for that container ID within the /proc/mounts
file on the host.
This yielded a brief result for the MariaDB container, but because it fails, the FS is quickly unmounted. Accessing the agent container for example is no problem, but AFAIK there's nothing of interest in there, right?
I'll try to do a race with kubernetes by trying to loop through all the containers with cat <containerID>/var/lib/mysql/mariadb.err
from the host. Hopefully one of them is a hit. Would be cool to see if that would work.
Hey @tflabs-nl !
Accessing the agent container for example is no problem, but AFAIK there's nothing of interest in there, right?
The agent
shares the /var/lib/mysql
volume with the mariadb
one, so it might be valid as well.
Hi @mmontes11 , seems like /var/lib/mysql
does not exist on that mountpoint. Which also kind of makes sense, as the PVC storage-mariadb-galera-dev-2
as well as galera-mariadb-galera-dev-2
are both RWO
. The path I was looking at also only contains the rootfs of the container, so no mounts included.
Eventually, I did find the correct mount point location. Getting the logs was as easy as SSH'ing into the node that currently runs (or is supposed to run) the pod and issue the following command:
cat $(find / -name "pvc-562ce812-e4bb-49a3-8d00-6a8963c017cf")/mount/mariadb.err
in which pvc-562ce812-e4bb-49a3-8d00-6a8963c017cf
is the full PVC name as obtained by kubectl get pvc -n mariadb-operator
Here is a small portion of the log:
2023-08-05 6:04:40 0 [Note] Starting MariaDB 11.0.2-MariaDB-1:11.0.2+maria~ubu2204-log source revision 0005f2f06c8e1aea4915887decad67885108a929 as process 88
2023-08-05 6:04:40 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2023-08-05 6:04:40 0 [Note] InnoDB: Number of transaction pools: 1
2023-08-05 6:04:40 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2023-08-05 6:04:40 0 [Note] mariadbd: O_TMPFILE is not supported on /tmp (disabling future attempts)
2023-08-05 6:04:40 0 [Note] InnoDB: Initializing buffer pool, total size = 12.000GiB, chunk size = 192.000MiB
2023-08-05 6:04:40 0 [Note] InnoDB: Completed initialization of buffer pool
2023-08-05 6:04:40 0 [Note] InnoDB: Buffered log writes (block size=512 bytes)
2023-08-05 6:04:41 0 [Note] InnoDB: Opened 3 undo tablespaces
2023-08-05 6:04:41 0 [Note] InnoDB: 128 rollback segments in 3 undo tablespaces are active.
2023-08-05 6:04:41 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
2023-08-05 6:04:41 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
2023-08-05 6:04:41 0 [Note] InnoDB: log sequence number 37745; transaction id 4
2023-08-05 6:04:41 0 [Warning] InnoDB: Skipping buffer pool dump/restore during wsrep recovery.
2023-08-05 6:04:41 0 [Note] Plugin 'FEEDBACK' is disabled.
2023-08-05 6:04:41 0 [Note] Plugin 'wsrep-provider' is disabled.
2023-08-05 6:04:41 0 [ERROR] Could not open mysql.plugin table: "Table 'mysql.plugin' doesn't exist". Some plugins may be not loaded
2023-08-05 6:04:41 0 [Warning] 'innodb-log-files-in-group' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2023-08-05 6:04:41 0 [Warning] 'innodb-buffer-pool-instances' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2023-08-05 6:04:41 0 [Warning] 'innodb-file-format' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2023-08-05 6:04:41 1 [Warning] Failed to load slave replication state from table mysql.gtid_slave_pos: 1017: Can't find file: './mysql/' (errno: 2 "No such file or directory")
2023-08-05 6:04:41 0 [ERROR] Can't open and lock privilege tables: Table 'mysql.servers' doesn't exist
2023-08-05 6:04:41 0 [Note] WSREP: disabled, skipping position recovery
Warning: Memory not freed: 280
2023-08-05 6:06:47 0 [Note] Starting MariaDB 11.0.2-MariaDB-1:11.0.2+maria~ubu2204-log source revision 0005f2f06c8e1aea4915887decad67885108a929 as process 88
2023-08-05 6:06:47 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2023-08-05 6:06:47 0 [Note] InnoDB: Number of transaction pools: 1
2023-08-05 6:06:47 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2023-08-05 6:06:47 0 [Note] mariadbd: O_TMPFILE is not supported on /tmp (disabling future attempts)
2023-08-05 6:06:47 0 [Note] InnoDB: Initializing buffer pool, total size = 12.000GiB, chunk size = 192.000MiB
2023-08-05 6:06:47 0 [Note] InnoDB: Completed initialization of buffer pool
2023-08-05 6:06:47 0 [Note] InnoDB: Buffered log writes (block size=512 bytes)
2023-08-05 6:06:48 0 [Note] InnoDB: Opened 3 undo tablespaces
2023-08-05 6:06:48 0 [Note] InnoDB: 128 rollback segments in 3 undo tablespaces are active.
2023-08-05 6:06:48 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
2023-08-05 6:06:48 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
2023-08-05 6:06:48 0 [Note] InnoDB: log sequence number 37745; transaction id 4
2023-08-05 6:06:48 0 [Warning] InnoDB: Skipping buffer pool dump/restore during wsrep recovery.
2023-08-05 6:06:48 0 [Note] Plugin 'FEEDBACK' is disabled.
2023-08-05 6:06:48 0 [Note] Plugin 'wsrep-provider' is disabled.
2023-08-05 6:06:48 0 [ERROR] Could not open mysql.plugin table: "Table 'mysql.plugin' doesn't exist". Some plugins may be not loaded
2023-08-05 6:06:48 0 [Warning] 'innodb-log-files-in-group' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2023-08-05 6:06:48 0 [Warning] 'innodb-buffer-pool-instances' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2023-08-05 6:06:48 0 [Warning] 'innodb-file-format' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2023-08-05 6:06:48 1 [Warning] Failed to load slave replication state from table mysql.gtid_slave_pos: 1017: Can't find file: './mysql/' (errno: 2 "No such file or directory")
2023-08-05 6:06:48 0 [ERROR] Can't open and lock privilege tables: Table 'mysql.servers' doesn't exist
2023-08-05 6:06:48 0 [Note] WSREP: disabled, skipping position recovery
Warning: Memory not freed: 280
2023-08-05 6:10:15 0 [Note] Starting MariaDB 11.0.2-MariaDB-1:11.0.2+maria~ubu2204-log source revision 0005f2f06c8e1aea4915887decad67885108a929 as process 88
2023-08-05 6:10:15 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2023-08-05 6:10:15 0 [Note] InnoDB: Number of transaction pools: 1
2023-08-05 6:10:15 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2023-08-05 6:10:15 0 [Note] mariadbd: O_TMPFILE is not supported on /tmp (disabling future attempts)
2023-08-05 6:10:15 0 [Note] InnoDB: Initializing buffer pool, total size = 12.000GiB, chunk size = 192.000MiB
2023-08-05 6:10:15 0 [Note] InnoDB: Completed initialization of buffer pool
2023-08-05 6:10:15 0 [Note] InnoDB: Buffered log writes (block size=512 bytes)
2023-08-05 6:10:16 0 [Note] InnoDB: Opened 3 undo tablespaces
2023-08-05 6:10:16 0 [Note] InnoDB: 128 rollback segments in 3 undo tablespaces are active.
2023-08-05 6:10:16 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
2023-08-05 6:10:16 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
2023-08-05 6:10:16 0 [Note] InnoDB: log sequence number 37745; transaction id 4
2023-08-05 6:10:16 0 [Warning] InnoDB: Skipping buffer pool dump/restore during wsrep recovery.
2023-08-05 6:10:16 0 [Note] Plugin 'FEEDBACK' is disabled.
2023-08-05 6:10:16 0 [Note] Plugin 'wsrep-provider' is disabled.
2023-08-05 6:10:16 0 [ERROR] Could not open mysql.plugin table: "Table 'mysql.plugin' doesn't exist". Some plugins may be not loaded
2023-08-05 6:10:16 0 [Warning] 'innodb-log-files-in-group' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2023-08-05 6:10:16 0 [Warning] 'innodb-buffer-pool-instances' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2023-08-05 6:10:16 0 [Warning] 'innodb-file-format' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2023-08-05 6:10:16 1 [Warning] Failed to load slave replication state from table mysql.gtid_slave_pos: 1017: Can't find file: './mysql/' (errno: 2 "No such file or directory")
2023-08-05 6:10:16 0 [ERROR] Can't open and lock privilege tables: Table 'mysql.servers' doesn't exist
2023-08-05 6:10:16 0 [Note] WSREP: disabled, skipping position recovery
Warning: Memory not freed: 280
2023-08-05 6:15:50 0 [Note] Starting MariaDB 11.0.2-MariaDB-1:11.0.2+maria~ubu2204-log source revision 0005f2f06c8e1aea4915887decad67885108a929 as process 88
2023-08-05 6:15:50 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2023-08-05 6:15:50 0 [Note] InnoDB: Number of transaction pools: 1
2023-08-05 6:15:50 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2023-08-05 6:15:50 0 [Note] mariadbd: O_TMPFILE is not supported on /tmp (disabling future attempts)
2023-08-05 6:15:50 0 [Note] InnoDB: Initializing buffer pool, total size = 12.000GiB, chunk size = 192.000MiB
2023-08-05 6:15:50 0 [Note] InnoDB: Completed initialization of buffer pool
2023-08-05 6:15:50 0 [Note] InnoDB: Buffered log writes (block size=512 bytes)
2023-08-05 6:15:51 0 [Note] InnoDB: Opened 3 undo tablespaces
2023-08-05 6:15:51 0 [Note] InnoDB: 128 rollback segments in 3 undo tablespaces are active.
2023-08-05 6:15:51 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
2023-08-05 6:15:51 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
2023-08-05 6:15:51 0 [Note] InnoDB: log sequence number 37745; transaction id 4
2023-08-05 6:15:51 0 [Warning] InnoDB: Skipping buffer pool dump/restore during wsrep recovery.
2023-08-05 6:15:51 0 [Note] Plugin 'FEEDBACK' is disabled.
2023-08-05 6:15:51 0 [Note] Plugin 'wsrep-provider' is disabled.
2023-08-05 6:15:51 0 [ERROR] Could not open mysql.plugin table: "Table 'mysql.plugin' doesn't exist". Some plugins may be not loaded
2023-08-05 6:15:51 0 [Warning] 'innodb-log-files-in-group' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2023-08-05 6:15:51 0 [Warning] 'innodb-buffer-pool-instances' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2023-08-05 6:15:51 0 [Warning] 'innodb-file-format' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2023-08-05 6:15:51 1 [Warning] Failed to load slave replication state from table mysql.gtid_slave_pos: 1017: Can't find file: './mysql/' (errno: 2 "No such file or directory")
2023-08-05 6:15:51 0 [ERROR] Can't open and lock privilege tables: Table 'mysql.servers' doesn't exist
2023-08-05 6:15:51 0 [Note] WSREP: disabled, skipping position recovery
Warning: Memory not freed: 280
2023-08-05 6:21:32 0 [Note] Starting MariaDB 11.0.2-MariaDB-1:11.0.2+maria~ubu2204-log source revision 0005f2f06c8e1aea4915887decad67885108a929 as process 88
2023-08-05 6:21:32 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2023-08-05 6:21:32 0 [Note] InnoDB: Number of transaction pools: 1
2023-08-05 6:21:32 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2023-08-05 6:21:32 0 [Note] mariadbd: O_TMPFILE is not supported on /tmp (disabling future attempts)
2023-08-05 6:21:32 0 [Note] InnoDB: Initializing buffer pool, total size = 12.000GiB, chunk size = 192.000MiB
2023-08-05 6:21:32 0 [Note] InnoDB: Completed initialization of buffer pool
2023-08-05 6:21:32 0 [Note] InnoDB: Buffered log writes (block size=512 bytes)
2023-08-05 6:21:33 0 [Note] InnoDB: Opened 3 undo tablespaces
2023-08-05 6:21:33 0 [Note] InnoDB: 128 rollback segments in 3 undo tablespaces are active.
2023-08-05 6:21:33 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
2023-08-05 6:21:33 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
2023-08-05 6:21:33 0 [Note] InnoDB: log sequence number 37745; transaction id 4
2023-08-05 6:21:33 0 [Warning] InnoDB: Skipping buffer pool dump/restore during wsrep recovery.
2023-08-05 6:21:33 0 [Note] Plugin 'FEEDBACK' is disabled.
2023-08-05 6:21:33 0 [Note] Plugin 'wsrep-provider' is disabled.
2023-08-05 6:21:33 0 [ERROR] Could not open mysql.plugin table: "Table 'mysql.plugin' doesn't exist". Some plugins may be not loaded
2023-08-05 6:21:33 0 [Warning] 'innodb-log-files-in-group' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2023-08-05 6:21:33 0 [Warning] 'innodb-buffer-pool-instances' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2023-08-05 6:21:33 0 [Warning] 'innodb-file-format' was removed. It does nothing now and exists only for compatibility with old my.cnf files.
2023-08-05 6:21:33 1 [Warning] Failed to load slave replication state from table mysql.gtid_slave_pos: 1017: Can't find file: './mysql/' (errno: 2 "No such file or directory")
2023-08-05 6:21:33 0 [ERROR] Can't open and lock privilege tables: Table 'mysql.servers' doesn't exist
2023-08-05 6:21:33 0 [Note] WSREP: disabled, skipping position recovery
Warning: Memory not freed: 280
I don't see any UUID's being mentioned unfortunately...
Hey @tflabs-nl ! thanks a lot for your effort diagnosing this. This is very valuable not only for your use case but also for the troubleshooting section I'm currently writing.
I don't see any UUID's being mentioned unfortunately...
right, but then this explains why the agent
fails trying to recover them, returning the error:
error unmarshaling bootstrap: unable to find uuid and seqno: uuid=<nil> seqno=<nil>
these log lines might be pointing the root cause:
2023-08-05 6:15:51 0 [ERROR] Can't open and lock privilege tables: Table 'mysql.servers' doesn't exist
2023-08-05 6:15:51 0 [Note] WSREP: disabled, skipping position recovery
after quickly googling them, I've seen other people reporting similar problems:
will dig deeper and get back to you
cc @grooverdan
Thanks a lot! 🙏🏻
This is how a healthy /var/lib/mysql/mariadb.err
looks like:
root@mariadb-galera-0:/# cat /var/lib/mysql/mariadb.err
2023-08-05 14:59:14 0 [Note] Starting MariaDB 11.0.2-MariaDB-1:11.0.2+maria~ubu2204 source revision 0005f2f06c8e1aea4915887decad67885108a929 as process 1
2023-08-05 14:59:14 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2023-08-05 14:59:14 0 [Note] InnoDB: Number of transaction pools: 1
2023-08-05 14:59:14 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2023-08-05 14:59:14 0 [Note] mariadbd: O_TMPFILE is not supported on /tmp (disabling future attempts)
2023-08-05 14:59:14 0 [Note] InnoDB: Using liburing
2023-08-05 14:59:14 0 [Note] InnoDB: Initializing buffer pool, total size = 128.000MiB, chunk size = 2.000MiB
2023-08-05 14:59:14 0 [Note] InnoDB: Completed initialization of buffer pool
2023-08-05 14:59:14 0 [Note] InnoDB: File system buffers for log disabled (block size=512 bytes)
2023-08-05 14:59:14 0 [Note] InnoDB: Opened 3 undo tablespaces
2023-08-05 14:59:14 0 [Note] InnoDB: 128 rollback segments in 3 undo tablespaces are active.
2023-08-05 14:59:14 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
2023-08-05 14:59:14 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
2023-08-05 14:59:14 0 [Note] InnoDB: log sequence number 81542; transaction id 95
2023-08-05 14:59:14 0 [Warning] InnoDB: Skipping buffer pool dump/restore during wsrep recovery.
2023-08-05 14:59:14 0 [Note] Plugin 'FEEDBACK' is disabled.
2023-08-05 14:59:15 0 [Note] Server socket created on IP: '0.0.0.0'.
2023-08-05 14:59:15 0 [Note] WSREP: Recovered position: 6ea235ec-3232-11ee-8152-4af03d2c43a9:17
Warning: Memory not freed: 280
I've spotted another difference:
2023-08-05 6:21:33 0 [Note] Plugin 'wsrep-provider' is disabled.
Which is trange, because we do enable wsrep
explicitly in /etc/mysql/mariadb.conf.d/0-galera.cnf
:
https://github.com/mariadb-operator/init/blob/main/pkg/config/config.go#L46
Maybe this is clashing with your spec.myCnf
?
Maybe this is clashing with your spec.myCnf?
I've checked this, and most likely it shouldn't be the case, see:
[Note] Plugin 'wsrep-provider' is disabled.
[ERROR] Could not open mysql.plugin table: "Table 'mysql.plugin' doesn't exist". Some plugins may be not loaded
My understanding is that somehow the mysql.plugin
table got corrupted and the wsrep-provider
plugin wasn't loaded. Unfortunately Galera relies on this and without it recovering the sequence is not possible.
Maybe it is worth starting from a clean database at this point if it's not production? I would definetly set sync_binlog=1
to restore the 13GB backup and then put it back to 0 if you want.
Seen some hacks to solve this though:
In production, the thing will be taking Backups regularly enough to minimize RTO and RPO, and Restore them if this situation happens to bring mysql.plugin
back to its operational state.
Hey @mmontes11 ! Had holiday and tried to do a bit of pc detoxing haha. Would it be beneficial for the troubleshooting documentation if I tried that 'hack'? If not I will recreate the environment from scratch and re-import without disabling the sync_binlog. Maybe a bit of hijacking my own issue, but does backup / restore do incremental backups?
Thank you so much for creating these troubleshooting steps!
I'm not quite sure about this hack, probably doesn't worth it, but I leave it up to you. Even if it helped to fix the mysql.plugin
table it wouldn't be something to be implemented in the operator, as automating it would bring more problems than benefits problably.
If not I will recreate the environment from scratch and re-import without disabling the sync_binlog.
This would be my take if you ask me
Maybe a bit of hijacking my own issue, but does backup / restore do incremental backups?
We don't support incremental backups but it is in our roadmap:
Before we do that, we are planning to support S3 storage for backups:
Thank you so much for creating these troubleshooting steps!
Thanks, it hopefully shed some light in your investigation.
This issue is stale because it has been open 30 days with no activity.
This issue was closed because it has been stalled for 5 days with no activity.
Yesterday evening I've deployed mariadb-operator for a development Galera cluster. I then started importing all table structures, which went fine. After that I tried to import some big sql data files in parallel over the course of the night. Biggest one being a 13G file. All files were originally created using mysqldump from a production database. Unfortunately when I woke up, I saw that all 3 of galera's instances had crashed and that the pods were in a CrashLoop.
Describe the bug
Unfortunately, this cluster does not have additional logging yet. So figuring out why the first crash happened is not an option anymore. What is an option, is to bootstrap the cluster and resume operatings from the last known good state. Unfortunately, mariadb-operator does not seem to provide a way to do this. All tries to start the servers end in the following logs:
Expected behaviour Mariadb-operator could either
Steps to reproduce the bug
Freshly deploy mariadb-operator (v0.19.0) using the following yaml:
Additional context We're using rook for storage provisioning. All disks report to be healthy and no crashes have been reported by ceph (or kubernetes) over the last 24 hours.
Environment details: