rabbitmq / cluster-operator

RabbitMQ Cluster Kubernetes Operator
https://www.rabbitmq.com/kubernetes/operator/operator-overview.html
Mozilla Public License 2.0
866 stars 270 forks source link

Nodes that form a new cluster may not cluster correctly #662

Closed gerhard closed 3 years ago

gerhard commented 3 years ago

Describe the bug

Since https://github.com/rabbitmq/cluster-operator/pull/621 was introduced, nodes that form a new cluster may not cluster correctly.

In our (+@ansd) case, pod quick-rabbit-server-0 formed its own cluster, while quick-rabbit-server-1 & quick-rabbit-server-2 formed a second cluster with the same name as the first cluster. Everything looks healthy from K8S (3 ready pods) & Erlang perspective (6 healthy distribution links), but we have 2 RabbitMQ clusters, one with 1 node & one with 2 nodes, and this is clearly wrong.

To Reproduce

This is a difficult one to reproduce as it's timing-specific. We are including all the logs and this can be reproduced with https://github.com/rabbitmq/observability-2021/tree/bf77efebc6760e16d8176bc0c7b750204d8b2a7e/talks/emea-tech-talk (private repo available to all maintainers of this repo) using the following steps:

make 1.k8s 2.k8s-rabbitmq 4.resolve-first-problem 5.second-problem

While 5.second-problem is not technically needed, it makes it very obvious as to what the problem is:

image image

Version and environment information

Quick fix

Our quick fix was to reset quick-rabbit-server-0:

rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl start_app

Additional context

kubectl logs --selector app.kubernetes.io/name=quick-rabbit --prefix=true --tail=-1
...
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:38:40.634 [info] <0.273.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@quick-rabbit-server-0.quick-rabbit-nodes.default is empty. Assuming we need to join an existing cluster or initialise from scratch...
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:38:40.634 [info] <0.273.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:38:40.634 [info] <0.273.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:38:40.634 [info] <0.273.0> Peer discovery backend does not support locking, falling back to randomized delay
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:38:40.634 [info] <0.273.0> Peer discovery backend rabbit_peer_discovery_k8s supports registration.
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:38:40.634 [info] <0.273.0> Will wait for 17965 milliseconds before proceeding with registration...
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:38:58.615 [info] <0.273.0> All discovered existing cluster peers: rabbit@quick-rabbit-server-0.quick-rabbit-nodes.default, rabbit@quick-rabbit-server-1.quick-rabbit-nodes.default, rabbit@quick-rabbit-server-2.quick-rabbit-nodes.default
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:38:58.615 [info] <0.273.0> Peer nodes we can cluster with: rabbit@quick-rabbit-server-1.quick-rabbit-nodes.default, rabbit@quick-rabbit-server-2.quick-rabbit-nodes.default
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:38:58.616 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-1.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:38:58.616 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-2.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:38:58.616 [error] <0.273.0> Trying to join discovered peers failed. Will retry after a delay of 500 ms, 9 retries left...
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:38:59.117 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-1.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:38:59.118 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-2.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:38:59.118 [error] <0.273.0> Trying to join discovered peers failed. Will retry after a delay of 500 ms, 8 retries left...
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:38:59.619 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-1.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:38:59.620 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-2.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:38:59.620 [error] <0.273.0> Trying to join discovered peers failed. Will retry after a delay of 500 ms, 7 retries left...
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:39:00.122 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-1.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:39:00.123 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-2.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:39:00.123 [error] <0.273.0> Trying to join discovered peers failed. Will retry after a delay of 500 ms, 6 retries left...
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:39:00.624 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-1.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:39:00.625 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-2.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:39:00.625 [error] <0.273.0> Trying to join discovered peers failed. Will retry after a delay of 500 ms, 5 retries left...
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:39:01.126 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-1.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:39:01.127 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-2.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:39:01.127 [error] <0.273.0> Trying to join discovered peers failed. Will retry after a delay of 500 ms, 4 retries left...
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:39:01.628 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-1.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:39:01.629 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-2.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:39:01.629 [error] <0.273.0> Trying to join discovered peers failed. Will retry after a delay of 500 ms, 3 retries left...
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:39:02.130 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-1.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:39:02.131 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-2.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:39:02.131 [error] <0.273.0> Trying to join discovered peers failed. Will retry after a delay of 500 ms, 2 retries left...
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:39:02.632 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-1.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:39:02.633 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-2.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:39:02.633 [error] <0.273.0> Trying to join discovered peers failed. Will retry after a delay of 500 ms, 1 retries left...
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:39:03.134 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-1.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:39:03.135 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-2.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:39:03.135 [error] <0.273.0> Trying to join discovered peers failed. Will retry after a delay of 500 ms, 0 retries left...
[pod/quick-rabbit-server-0/rabbitmq] 2021-04-14 14:39:03.637 [warning] <0.273.0> Could not successfully contact any node of: rabbit@quick-rabbit-server-1.quick-rabbit-nodes.default,rabbit@quick-rabbit-server-2.quick-rabbit-nodes.default (as in Erlang distribution). Starting as a blank standalone node...
...
[pod/quick-rabbit-server-2/rabbitmq] 2021-04-14 14:38:40.764 [info] <0.273.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@quick-rabbit-server-2.quick-rabbit-nodes.default is empty. Assuming we need to join an existing cluster or initialise from scratch...
[pod/quick-rabbit-server-2/rabbitmq] 2021-04-14 14:38:40.764 [info] <0.273.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
[pod/quick-rabbit-server-2/rabbitmq] 2021-04-14 14:38:40.764 [info] <0.273.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
[pod/quick-rabbit-server-2/rabbitmq] 2021-04-14 14:38:40.764 [info] <0.273.0> Peer discovery backend does not support locking, falling back to randomized delay
[pod/quick-rabbit-server-2/rabbitmq] 2021-04-14 14:38:40.764 [info] <0.273.0> Peer discovery backend rabbit_peer_discovery_k8s supports registration.
[pod/quick-rabbit-server-2/rabbitmq] 2021-04-14 14:38:40.764 [info] <0.273.0> Will wait for 21437 milliseconds before proceeding with registration...
[pod/quick-rabbit-server-2/rabbitmq] 2021-04-14 14:39:02.211 [info] <0.273.0> All discovered existing cluster peers: rabbit@quick-rabbit-server-0.quick-rabbit-nodes.default, rabbit@quick-rabbit-server-1.quick-rabbit-nodes.default, rabbit@quick-rabbit-server-2.quick-rabbit-nodes.default
[pod/quick-rabbit-server-2/rabbitmq] 2021-04-14 14:39:02.211 [info] <0.273.0> Peer nodes we can cluster with: rabbit@quick-rabbit-server-0.quick-rabbit-nodes.default, rabbit@quick-rabbit-server-1.quick-rabbit-nodes.default
[pod/quick-rabbit-server-2/rabbitmq] 2021-04-14 14:39:02.212 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-0.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-2/rabbitmq] 2021-04-14 14:39:02.212 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-1.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-2/rabbitmq] 2021-04-14 14:39:02.212 [error] <0.273.0> Trying to join discovered peers failed. Will retry after a delay of 500 ms, 9 retries left...
[pod/quick-rabbit-server-2/rabbitmq] 2021-04-14 14:39:02.714 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-0.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-2/rabbitmq] 2021-04-14 14:39:02.715 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-1.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-2/rabbitmq] 2021-04-14 14:39:02.715 [error] <0.273.0> Trying to join discovered peers failed. Will retry after a delay of 500 ms, 8 retries left...
[pod/quick-rabbit-server-2/rabbitmq] 2021-04-14 14:39:03.216 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-0.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-2/rabbitmq] 2021-04-14 14:39:03.217 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-1.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-2/rabbitmq] 2021-04-14 14:39:03.217 [error] <0.273.0> Trying to join discovered peers failed. Will retry after a delay of 500 ms, 7 retries left...
[pod/quick-rabbit-server-2/rabbitmq] 2021-04-14 14:39:03.717 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-0.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-2/rabbitmq] 2021-04-14 14:39:03.733 [info] <0.273.0> Node 'rabbit@quick-rabbit-server-1.quick-rabbit-nodes.default' selected for auto-clustering
...
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:38:40.763 [info] <0.273.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@quick-rabbit-server-1.quick-rabbit-nodes.default is empty. Assuming we need to join an existing cluster or initialise from scratch...
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:38:40.763 [info] <0.273.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:38:40.763 [info] <0.273.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:38:40.763 [info] <0.273.0> Peer discovery backend does not support locking, falling back to randomized delay
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:38:40.763 [info] <0.273.0> Peer discovery backend rabbit_peer_discovery_k8s supports registration.
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:38:40.764 [info] <0.273.0> Will wait for 17512 milliseconds before proceeding with registration...
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:38:58.292 [info] <0.273.0> All discovered existing cluster peers: rabbit@quick-rabbit-server-0.quick-rabbit-nodes.default, rabbit@quick-rabbit-server-1.quick-rabbit-nodes.default, rabbit@quick-rabbit-server-2.quick-rabbit-nodes.default
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:38:58.292 [info] <0.273.0> Peer nodes we can cluster with: rabbit@quick-rabbit-server-0.quick-rabbit-nodes.default, rabbit@quick-rabbit-server-2.quick-rabbit-nodes.default
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:38:58.297 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-0.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:38:58.302 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-2.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:38:58.302 [error] <0.273.0> Trying to join discovered peers failed. Will retry after a delay of 500 ms, 9 retries left...
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:38:58.803 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-0.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:38:58.804 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-2.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:38:58.804 [error] <0.273.0> Trying to join discovered peers failed. Will retry after a delay of 500 ms, 8 retries left...
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:38:59.305 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-0.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:38:59.306 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-2.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:38:59.306 [error] <0.273.0> Trying to join discovered peers failed. Will retry after a delay of 500 ms, 7 retries left...
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:38:59.807 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-0.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:38:59.808 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-2.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:38:59.808 [error] <0.273.0> Trying to join discovered peers failed. Will retry after a delay of 500 ms, 6 retries left...
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:39:00.309 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-0.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:39:00.310 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-2.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:39:00.310 [error] <0.273.0> Trying to join discovered peers failed. Will retry after a delay of 500 ms, 5 retries left...
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:39:00.811 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-0.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:39:00.812 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-2.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:39:00.812 [error] <0.273.0> Trying to join discovered peers failed. Will retry after a delay of 500 ms, 4 retries left...
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:39:01.313 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-0.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:39:01.314 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-2.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:39:01.314 [error] <0.273.0> Trying to join discovered peers failed. Will retry after a delay of 500 ms, 3 retries left...
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:39:01.815 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-0.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:39:01.816 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-2.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:39:01.816 [error] <0.273.0> Trying to join discovered peers failed. Will retry after a delay of 500 ms, 2 retries left...
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:39:02.317 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-0.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:39:02.318 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-2.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:39:02.318 [error] <0.273.0> Trying to join discovered peers failed. Will retry after a delay of 500 ms, 1 retries left...
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:39:02.819 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-0.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:39:02.820 [warning] <0.273.0> Could not auto-cluster with node rabbit@quick-rabbit-server-2.quick-rabbit-nodes.default: {error,tables_not_present}
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:39:02.820 [error] <0.273.0> Trying to join discovered peers failed. Will retry after a delay of 500 ms, 0 retries left...
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:39:03.320 [warning] <0.273.0> Could not successfully contact any node of: rabbit@quick-rabbit-server-0.quick-rabbit-nodes.default,rabbit@quick-rabbit-server-2.quick-rabbit-nodes.default (as in Erlang distribution). Starting as a blank standalone node...
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:39:03.324 [info] <0.44.0> Application mnesia exited with reason: stopped
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:39:03.324 [info] <0.44.0> Application mnesia exited with reason: stopped
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:39:03.351 [info] <0.44.0> Application mnesia started on node 'rabbit@quick-rabbit-server-1.quick-rabbit-nodes.default'
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:39:03.461 [info] <0.273.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:39:03.461 [info] <0.273.0> Successfully synced tables from a peer
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:39:03.494 [info] <0.273.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
[pod/quick-rabbit-server-1/rabbitmq] 2021-04-14 14:39:03.494 [info] <0.273.0> Successfully synced tables from a peer
...

Attaching all logs as a file (1000+ lines): parallel-startup-problem.txt

Related improvement

In the context of alerts, we are missing metrics that would enable us to alert when the expected number of RabbitMQ nodes are not present in the cluster. In this case, we were expecting RabbitMQ to form a 3-node cluster. If that doesn't happen, we should have an alert that would catch it. Something for myself & @ansd to follow-up on. I'm adding it here so that we have it all in a single place.

ansd commented 3 years ago

This happens sporadically on kind v0.10.0 go1.15.7 darwin/amd64 as well.

Steps to reproduce:

kind delete cluster; kind create cluster; kubectl apply -f https://github.com/rabbitmq/cluster-operator/releases/download/v1.6.0/cluster-operator.yml ; kubectl rabbitmq create myrabbit --replicas 3

Once all pods are ready:

> kubectl exec myrabbit-server-0 -- rabbitmqctl cluster_status | grep -A 5 "Running Nodes"
Defaulted container "rabbitmq" out of: rabbitmq, setup-container (init)
Running Nodes

rabbit@myrabbit-server-0.myrabbit-nodes.default
rabbit@myrabbit-server-1.myrabbit-nodes.default

Versions
> kubectl exec myrabbit-server-2 -- rabbitmqctl cluster_status | grep -A 5 "Running Nodes"
Defaulted container "rabbitmq" out of: rabbitmq, setup-container (init)
Running Nodes

rabbit@myrabbit-server-2.myrabbit-nodes.default

Versions
slacksach commented 3 years ago

Observed this behaviour yesterday. We were performing an upgrade from 0.48.0 to 1.6.0, the -0 node started then -1 and -2 started and formed their own cluster.

toredash commented 3 years ago

Confirm, same behaviour here. I find it strange that PR got included as the documentation cleary says OrderedReady should be used: https://www.rabbitmq.com/cluster-formation.html

Use OrderedReady Pod Management Policy
Peer discovery mechanism will filter out nodes whose pods are not yet ready (initialised) according to their readiness probe as reported by the Kubernetes API. For example, if pod management policy of a stateful set is set to Parallel, some nodes may be discovered but will not be joined. To work around this, the Kubernetes peer discovery plugin uses randomized startup delays.

Deployments that use the OrderedReady pod management policy start pods one by one and therefore all discovered nodes will be ready to join. This policy is used by default by Kubernetes.
ansd commented 3 years ago

@toredash what K8s cluster (IaaS) do you use?

We observe this bug only sporadically when creating a new RabbitMQ cluster. Rolling updates are not affected because Parallel Pod Management doesn't apply to updates.

toredash commented 3 years ago

@toredash what K8s cluster (IaaS) do you use?

GKE

It's easy to fix as you can set this value manually here: https://github.com/rabbitmq/cluster-operator/blob/main/docs/api/rabbitmq.com.ref.asciidoc#k8s-api-github-com-rabbitmq-cluster-operator-api-v1beta1-statefulsetspec

It's not allowed to modify this value, so the statefulset must be destroyed afterwards if the value changes.

We observe this bug only sporadically when creating a new RabbitMQ cluster. Rolling updates are not affected because Parallel Pod Management doesn't apply to updates.

What updates are you referring to ? Rolling updates are only applicable containers, labels, resource request/limits, and annotations for the Pods in a StatefulSet.: https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#update-strategies

If you want to see how badly this can break, do a kubectl --namespace rabbitmq-system rollout restart statefulset rabbitmq-operator-server, I'm pretty sure the cluster will break as the cluster is not shutdown cleanly.

ansd commented 3 years ago

@toredash I'm not sure what issue you observe, but I'm sure that the issue you observe is completely different from this GitHub issue.

This GitHub issue is about nodes failing to cluster for newly created RabbitMQ Clusters. So far we observe this sporadically only on kind and on k3s.

What updates are you referring to ?

I'm referring to the rolling update of the Pods in the StatefulSet. Rolling updates work perfectly from what we can tell. We didn't observe any issues. If you do observe an issue with rolling updates, please open a separate GitHub issue with exact steps to reproduce.

toredash commented 3 years ago

@toredash I'm not sure what issue you observe, but I'm sure that the issue you observe is completely different from this GitHub issue.

This GitHub issue is about nodes failing to cluster for newly created RabbitMQ Clusters. So far we observe this sporadically only on kind and on k3s.

This is what we are seing. If we create a cluster without podManagementPolicy:OrderedReady, sometimes it will fail. I think it is a timing issue, as if node-0 and node-1 believes they are first to start the cluster, they will form their independent cluster.

With podManagementPolicy:OrderedReady set, we are not able to reproduce this.

michaelklishin commented 3 years ago

@toredash the approach and thinking of what's recommended has evolved over the 3-4 years since K8S peer discovery was introduced, in large part because of all the work that's gone into this Operator.

That section of the docs was not updated after our team has decided that this Operator can, in fact, quite safely use parallel pod startup. The only tricky part is that the randomized startup delay for K8S uses a very narrow default of [0, 2] seconds compared to [5, 30] or something like that in other places. It was used to avoid delaying deployments where pods are started one by one. In the brave new world of parallel deployment, those settings are counterproductive.

The default is hardcoded and can be easily changed. Tomorrow we should have an image you can try that will use whatever values you provide

michaelklishin commented 3 years ago

RabbitMQ nodes will log the effective randomized startup delay value and the range it was randomly picked from:

Randomized startup delay: configured range is from … to … milliseconds, PRNG pick: …

That is logged at debug level. So if you have a way to reproduce, please set log level to debug and look for that message on all nodes (each node uses its own delay: that's the point)

mkuratczyk commented 3 years ago

The Operator already stets the following range in the configuration:

    cluster_formation.randomized_startup_delay_range.min = 5
    cluster_formation.randomized_startup_delay_range.max = 30

It seems like this range is too narrow to prevent the race condition during cluster formation. We've discussed that and decided that:

  1. as a short term solution, we'll look into making this default range wider. However, you can do that right now through additionalConfig. A range like 5-300 would almost certainly make this problem disappear.
  2. This will take longer but we'll look into changing k8s peer discovery plugin to use locking instead of random delay. This should ultimately solve the issue

It's worth remembering that we introduced a parallel policy for a reason - it was to solve a problem where a cluster wouldn't start at all if multiple pods were deleted (if all of them were deleted due to an outage, it was almost guaranteed it wouldn't start). Therefore, going back to OrderedReady is not a solution - it's just choosing one problem over another. We decided that an issue with a newly deployed cluster is less of a problem than a cluster that won't start after an outage, because new clusters by definitions don't have any data nor clients yet.

michaelklishin commented 3 years ago

5-300 sounds pretty broad but OK. Note that the peer discovery plugin will filter out pods that are not considered ready. This significantly increases the likelihood of a node discovering no peers.

Locking is an option as some other peer discovery plugins use it (etcd, Consul). We just need to decide on what would be an optimal locking mechanism in the Kubernetes API. Locking is a non-trivial problem in practice, so it can take a while.

mkuratczyk commented 3 years ago

I've run some tests to see how often the issue occurs. With default Operator settings - min=5, max=30, the issue occurred in 7% of cases (I created 100 3-node clusters for this test). The main reason is that the startup delay is not uniformly distributed between min and max - the delay value is generated between 0 and max and only then corrected - if it happens to be below min, it is set to min: https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit/src/rabbit_peer_discovery.erl#L191-L194. This leads to a relatively large number of nodes waiting exactly the minimum required amount of time. For example, here are some of my results (first column is the number of nodes in the cluster as reported by each node so 212 means nodes 0 and 2 were clustered but node 1 was solo; the next 3 columns are the delays reported by each node):

122 5000 6761 9769
212 5000 7239 5000
221 12020 5000 5000
221 25781 5000 5000
221 7832 11910 8080

There are a few observations:

I then deployed 100 clusters with min=0 and max=60 and the issue did not occur even once.

Conclusions:

01045972746 commented 3 years ago

I have same situation with creating new cluster. Before read this, i think operator some issue with permission.

[root@mycluster ~]$ kubectl -n my-rabbit logs my-q-server-2 -c setup-container
chown: changing ownership of '/var/lib/rabbitmq/mnesia/': Operation not permitted

[root@mycluster ~]$ kubectl -n my-rabbit get po
NAME                               READY   STATUS        RESTARTS   AGE
my-q-server-0             1/1     Running       0          45m
my-q-server-1             1/1     Running       0          45m
my-q-server-2             0/1     Running       0          45m

Is this issue related with peer-discovering timing as these thread said?

mkuratczyk commented 3 years ago

No, this is something different. Please open a separate issue and provide all the necessary details. Also, if that's on Openshift, please have a look at https://www.rabbitmq.com/kubernetes/operator/install-operator.html#openshift if you haven't.

ansd commented 3 years ago

Closing this issue since it got fixed in https://github.com/rabbitmq/rabbitmq-server/pull/3075 and will be available in RabbitMQ >= 3.8.18 as well as RabbitMQ 3.9.x.

For RabbitMQ < 3.8.18, https://github.com/rabbitmq/cluster-operator/pull/675 decreases the likelihood of this issue good enough for the time being. (If you still observe this issue, feel free to set cluster_formation.randomized_startup_delay_range.max to a value larger than 60.)

discostur commented 2 years ago

Just wanted to let you now that i experienced this issue today with

rabbitMQ v3.8.27 and rabbitMQ operator v.1.11.1

Was able to fix it with

rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl start_app

I think the operator should be aware of the cluster state and not just of the running pods ...

Regards Kilian

mkuratczyk commented 2 years ago

We are now using a global lock which should prevent such issues. If it happened, that sounds like a bug in this locking logic. Can you share the logs, ideally from all nodes as they formed the cluster?

discostur commented 2 years ago

@mkuratczyk i think i found the error and it might not be related to this issue ... seems like a problem with the feature flags. This is the log from the one container which does not connect correctly to the cluster:

2022-03-07 16:38:07.331 [info] <0.11977.0> Successfully synced tables from a peer
2022-03-07 16:38:07.331 [warning] <0.11977.0> Feature flags: the previous instance of this node must have failed to write the `feature_flags` file at `/bitnami/rabbitmq/mnesia/rabbit@rabbitmq-test-backend-dev-server-2.rabbitmq-test-backend-dev-nodes.test-backend-infrastructure-dev-feature_flags`:
2022-03-07 16:38:07.331 [warning] <0.11977.0> Feature flags:   - list of previously disabled feature flags now marked as such: [empty_basic_get_metric]
2022-03-07 16:38:07.351 [error] <0.11977.0> Feature flags: error while running rabbit_feature_flags:do_sync_feature_flags_with_node[[drop_unroutable_metric,empty_basic_get_metric,implicit_default_bindings,maintenance_mode_status,quorum_queue,user_limits,virtual_host_metadata]] on node `rabbit@rabbitmq-test-backend-dev-server-1.rabbitmq-test-backend-dev-nodes.test-backend-infrastructure-dev`: {'EXIT',{{badmap,undefined},[{maps,get,[depends_on,undefined,[]],[{file,"maps.erl"},{line,517},{error_info,#{module => erl_stdlib_errors}}]},{rabbit_feature_flags,enable_dependencies,2,[{file,"src/rabbit_feature_flags.erl"},{line,1564}]},{rabbit_feature_flags,do_enable_locally,1,[{file,"src/rabbit_feature_flags.erl"},{line,1544}]},{rabbit_feature_flags,do_sync_feature_flags_with_node,1,[{file,"src/rabbit_feature_flags.erl"},{line,2174}]}]}}
2022-03-07 16:38:07.359 [info] <0.44.0> Application mnesia exited with reason: stopped
2022-03-07 16:38:07.359 [error] <0.11977.0> 
2022-03-07 16:38:07.359 [error] <0.11977.0> BOOT FAILED
Logger - error: 
BOOT FAILED
2022-03-07 16:38:07.359 [error] <0.11977.0> ===========
===========
Error during startup: {error,
2022-03-07 16:38:07.359 [error] <0.11977.0> Error during startup: {error,
                       {incompatible_feature_flags,
                        {badrpc,
2022-03-07 16:38:07.359 [error] <0.11977.0>                        {incompatible_feature_flags,
2022-03-07 16:38:07.359 [error] <0.11977.0>                         {badrpc,
2022-03-07 16:38:07.359 [error] <0.11977.0>                          {'EXIT',
2022-03-07 16:38:07.359 [error] <0.11977.0>                           {{badmap,undefined},
                         {'EXIT',
                          {{badmap,undefined},
                           [{maps,get,
2022-03-07 16:38:07.359 [error] <0.11977.0>                            [{maps,get,
                             [depends_on,undefined,[]],
2022-03-07 16:38:07.360 [error] <0.11977.0>                              [depends_on,undefined,[]],
2022-03-07 16:38:07.360 [error] <0.11977.0>                              [{file,"maps.erl"},
                             [{file,"maps.erl"},
2022-03-07 16:38:07.360 [error] <0.11977.0>                               {line,517},
                              {line,517},
2022-03-07 16:38:07.360 [error] <0.11977.0>                               {error_info,#{module => erl_stdlib_errors}}]},
                              {error_info,#{module => erl_stdlib_errors}}]},
2022-03-07 16:38:07.360 [error] <0.11977.0>                             {rabbit_feature_flags,enable_dependencies,2,
                            {rabbit_feature_flags,enable_dependencies,2,
                             [{file,"src/rabbit_feature_flags.erl"},
2022-03-07 16:38:07.360 [error] <0.11977.0>                              [{file,"src/rabbit_feature_flags.erl"},
2022-03-07 16:38:07.360 [error] <0.11977.0>                               {line,1564}]},
                              {line,1564}]},
2022-03-07 16:38:07.360 [error] <0.11977.0>                             {rabbit_feature_flags,do_enable_locally,1,
                            {rabbit_feature_flags,do_enable_locally,1,
2022-03-07 16:38:07.360 [error] <0.11977.0>                              [{file,"src/rabbit_feature_flags.erl"},
                             [{file,"src/rabbit_feature_flags.erl"},
2022-03-07 16:38:07.361 [error] <0.11977.0>                               {line,1544}]},
                              {line,1544}]},
2022-03-07 16:38:07.361 [error] <0.11977.0>                             {rabbit_feature_flags,
                            {rabbit_feature_flags,
2022-03-07 16:38:07.361 [error] <0.11977.0>                              do_sync_feature_flags_with_node,1,
                             do_sync_feature_flags_with_node,1,
2022-03-07 16:38:07.361 [error] <0.11977.0>                              [{file,"src/rabbit_feature_flags.erl"},
                             [{file,"src/rabbit_feature_flags.erl"},
                              {line,2174}]}]}}}}}
2022-03-07 16:38:07.361 [error] <0.11977.0>                               {line,2174}]}]}}}}}

2022-03-07 16:38:07.361 [error] <0.11977.0> 
2022-03-07 16:38:08.362 [error] <0.11976.0> CRASH REPORT Process <0.11976.0> with 0 neighbours exited with reason: {{incompatible_feature_flags,{badrpc,{'EXIT',{{badmap,undefined},[{maps,get,[depends_on,undefined,[]],[{file,"maps.erl"},{line,517},{error_info,#{module => erl_stdlib_errors}}]},{rabbit_feature_flags,enable_dependencies,2,[{file,"src/rabbit_feature_flags.erl"},{line,1564}]},{rabbit_feature_flags,do_enable_locally,1,[{file,"src/rabbit_feature_flags.erl"},{line,1544}]},{rabbit_feature_flags,do_sync_feature_flags_with_node,1,[{file,"src/rabbit_feature_flags.erl"},{line,2174}]}]}}}},{rabbit,start,...}} in application_master:init/4 line 142
2022-03-07 16:38:08.362 [info] <0.44.0> Application rabbit exited with reason: {{incompatible_feature_flags,{badrpc,{'EXIT',{{badmap,undefined},[{maps,get,[depends_on,undefined,[]],[{file,"maps.erl"},{line,517},{error_info,#{module => erl_stdlib_errors}}]},{rabbit_feature_flags,enable_dependencies,2,[{file,"src/rabbit_feature_flags.erl"},{line,1564}]},{rabbit_feature_flags,do_enable_locally,1,[{file,"src/rabbit_feature_flags.erl"},{line,1544}]},{rabbit_feature_flags,do_sync_feature_flags_with_node,1,[{file,"src/rabbit_feature_flags.erl"},{line,2174}]}]}}}},{rabbit,start,...}}
2022-03-07 16:38:08.365 [info] <0.44.0> Application sysmon_handler exited with reason: stopped
2022-03-07 16:38:08.369 [info] <0.44.0> Application ra exited with reason: stopped
2022-03-07 16:38:08.371 [info] <0.44.0> Application os_mon exited with reason: stopped
2022-03-07 16:38:08.372 [error] <0.9646.0> rabbit_outside_app_process:
{error,{rabbit,{{incompatible_feature_flags,{badrpc,{'EXIT',{{badmap,undefined},[{maps,get,[depends_on,undefined,[]],[{file,"maps.erl"},{line,517},{error_info,#{module => erl_stdlib_errors}}]},{rabbit_feature_flags,enable_dependencies,2,[{file,"src/rabbit_feature_flags.erl"},{line,1564}]},{rabbit_feature_flags,do_enable_locally,1,[{file,"src/rabbit_feature_flags.erl"},{line,1544}]},{rabbit_feature_flags,do_sync_feature_flags_with_node,1,[{file,"src/rabbit_feature_flags.erl"},{line,2174}]}]}}}},{rabbit,start,[normal,[]]}}}}
[{rabbit,start_it,1,[{file,"src/rabbit.erl"},{line,382}]},{rabbit_node_monitor,do_run_outside_app_fun,1,[{file,"src/rabbit_node_monitor.erl"},{line,754}]}]

The docker container is still running because health checks only check TCP connection.

Should i open up a new issue?

PS: just saw that most of my queues got lost after resetting the one container and conneting it to the other two nodes / cluster ... persistence was enabled for all nodes ...

dumbbell commented 2 years ago

Hi!

@discostur: Could you please share the entire log files from all cluster members? It would make it easier to understand what went wrong.

If you can reproduce, it would be interesting to get those logs with debug logging enabled as well.

discostur commented 2 years ago

@dumbbell i just bootstrapped a new 3x node cluster and i'm able to reproduce:

Seeing this via the web UI. Debug logs from all three pods are attached.

Regarding the other feature flags error / missing queues after rolling upgrade i'm still trying to reproduce ...

rabbit_no_cluster_node2.txt rabbit_no_cluster_node1.txt rabbit_no_cluster_node0.txt

discostur commented 2 years ago

@dumbbell just were able to reproduce the loss of one out of three queues after a rolling restart (just changed one config value and operator will trigger rolling restart). Queues before restart:

Policy with ha-mode: all and ha-sync-mode: automatic was applied before and queues were running on all three nodes before the restart. However in this run, all three nodes were able to form a cluster (didn't have to stop / reset one node).

rabbit_missing_queue_node0.txt rabbit_missing_queue_node2.txt rabbit_missing_queue_node1.txt

mkuratczyk commented 2 years ago

I think the problems are related to the image you are using. I see "bitnami" in our paths, so I guess you are trying to use the Operator with the Bitnami image. Can you provide your cluster definition YAML? We ran some compatibility tests in the past and our Bitnami colleagues added some variables for compatibility but it's not a well tested / commonly used combination.

discostur commented 2 years ago

@mkuratczyk correct, i use the operator deployed via helm chart (bitnami is the only option at the moment as far as i know ...).

My cluster definition:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: rabbitmq-test
  namespace: default
spec:
  replicas: 3
  resources:
    requests:
      cpu: 2
      memory: 1Gi
    limits:
      cpu: 4
      memory: 4Gi
  rabbitmq:
    additionalPlugins:
      - rabbitmq_federation
      - rabbitmq_federation_management
      - rabbitmq_auth_backend_ldap
    additionalConfig: |
      cluster_partition_handling = pause_minority
      vm_memory_high_watermark_paging_ratio = 0.99
      disk_free_limit.relative = 1.0
      collect_statistics_interval = 10000
      log.console = true
      log.console.level = debug
      ## LDAP
      # first - use internal database
      auth_backends.1 = internal
      # second - fall back to LDAP for authentication
      auth_backends.2.authn = ldap
      # second - use internal database for authorisation
      auth_backends.2.authz = internal
      # ldap connection
      auth_ldap.servers.2 = XXX
      auth_ldap.dn_lookup_bind.user_dn = XXX
      auth_ldap.dn_lookup_bind.password = XXX
      auth_ldap.dn_lookup_attribute = sAMAccountName
      auth_ldap.dn_lookup_base = XXX
  persistence:
    storageClassName: csi-rbd-ssd
    storage: "5Gi"
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
            - key: app.kubernetes.io/name
              operator: In
              values:
              - rabbitmq-test
        topologyKey: kubernetes.io/hostname
mkuratczyk commented 2 years ago

Deploying the Operator itself with a chart is one thing but you seem to be using the Bitnami RabbitMQ container image (I can see Welcome to the Bitnami rabbitmq container in the logs), even though your YAML doesn't define the image so it should default to https://hub.docker.com/_/rabbitmq. Can you explicitly set image to the non-bitnami image and see if you still have the problems?

Side note: using a different memory request and limit is risky (the Operator prints a warning about it)

Zerpet commented 2 years ago

Similar report in #989. @discostur could you try with a statefulset override to set the pod management policy to OrderedReady?

Adding the override would look like this:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: rabbitmq-test
  namespace: default
spec:
  override:
    statefulSet:
      spec:
        podManagementPolicy: "OrderedReady"
[...]
discostur commented 2 years ago

sorry for the late response ... just had a little time to debug it further.

@Zerpet your override does fix the problem during cluster creation. As i applied the tried to reproduce the bootstrap / cluster formation hickup but it always worked. So this definitly fixes the cluster creation bug. Thanks ;)

@mkuratczyk i tried it several times not specifically defining the rabbitmq image in the cluster yaml but it seems that the bitnami operator uses per default its own images (also seems to still deploy 3.8x images instead of 3.9x images). However, with the override from zerpet i wasn't able to reproduce any issues at all.

thanks ;)

mkuratczyk commented 2 years ago

Well, I can tell you exactly how to reproduce an issue when OrderedReady is set - just delete all of the RabbitMQ pods. Unless you are lucky, and the last pod to go down happens to be server-0, your cluster won't start. This issue had been reported multiple times before we switched to Parallel. In my opinion, setting this value to OrderedReady is a mistake - you are risking a much bigger issue (a previously working/used cluster doesn't work after an outage) to solve a minor issue (a cluster that was never used by anyone, is not working properly).

Parallel should work. We've tested it extensively. If it doesn't work is some strange case - this should be diagnosed and fixed.

toredash commented 2 years ago

Parallel should work. We've tested it extensively. If it doesn't work is some strange case - this should be diagnosed and fixed.

Paralell is the only option that is correct here, IMO. I gave up trying to explain this in https://github.com/rabbitmq/cluster-operator/issues/662#issuecomment-829069757. If people want to spend hours of debugging this to save some seconds on when rolling new clusters or doing maintenance, thats fine.

michaelklishin commented 2 years ago

@toredash there are many tricky aspects about forming a cluster or restarting it when it comes to stateful distributed data services. Blanket statements are not appreciated in this community because they throw all nuance out the window.

toredash commented 2 years ago

@toredash there are many tricky aspects about forming a cluster or restarting it when it comes to stateful distributed data services. Blanket statements are not appreciated in this community because they throw all nuance out the window.

You know what, your right. I realize I didn't read the last response fully and did not understand it. I re-read it and would like to comment again:

Paralell is not the correct option, OrderedReady is the correct option. I fail to understand why Paralell is used as-of-now, as it introduces a possibility of forming multiple independent clusters when you create a new Statefulset. This is because the Service endpoint is not a reliable source of peers during discovery as the different rabbitmq instances will report Ready at different times.

This can all be avoided with OrderedReady. Let the first node complete its boot-process, allow kubernetes to mark it as Healthy, the continue with creating the second, third and fourth pod, in an Ordered, and Ready, fashion.

michaelklishin commented 2 years ago

Because RabbitMQ node expectations of when their peers come up do not match those of OrderedReady with an ill-picked health check:

This can be addressed in 4.0 with a new schema database but honestly, RabbitMQ nodes do not expect anything unreasonable, they are just not Web app instances that Kubernetes deployment options are heavily geared towards (let's call a spade a spade).

So a workaround is needed and parallel is one of the options on the table.

ansd commented 2 years ago

This is because the Service endpoint is not a reliable source of peers during discovery as the different rabbitmq instances will report Ready at different times.

@toredash this is not correct. That's what field publishNotReadyAddresses is used for. This field is set for the rabbitmq/cluster-operator here and for the Bitnami chart since November via https://github.com/bitnami/charts/pull/8135.

See also

discostur commented 2 years ago

Would it not be possible to set OrderedReady on cluster creation for correct bootstrapping and once the cluster is setup the operator patches the statefulset to parallel - so if you face a cluster crash / reboot of all pods everything should come up correct.

As far as my tests showed the situation where one node is standalone and two form a cluster does only happen on cluster bootstrapping.

mkuratczyk commented 2 years ago

No, Kubernetes does not allow changing this property.

To recap my position on this:Parallel matches RabbitMQ's expectations (restarting nodes wait for nodes that were still up, when they stopped - this leads to a dead-lock if other nodes don't attempt to start until the first node is ready) and solves the problem of the cluster not starting after a complete outage, which was reported multiple times. To make it work well during cluster formation, we implemented a locking mechanism, which shipped in 3.8.18, which should prevent issues in cluster formation. We deployed hundreds of clusters to validate these changes and we keep using the Operator and deploying clusters since then. I personally deploy 10+ per day probably, often dozens a day, and I haven't seen a single problem with cluster formation. This doesn't mean that everything is bug-free but this issue only affects freshly deployed clusters (clusters that no-one has used before) under some specific conditions (based on a very low number of reports we have received about this). Having such clusters not start correctly is not great but it's much better than risking every Operator-managed cluster not restarting correctly after an outage, with no obvious way to make it start.

What else can we do?

  1. in 3.9.9 we added cluster_formation.target_cluster_size_hint, which is currently used to skip definitions import until all expected nodes join the cluster. I think we should set it by default in Operator deployments (unfortunately this will need to be done through command-line flags, since we don't have a mechanism to only set it for deployments of 3.9.9+, but that's a detail - it can be done)
  2. we can extend the usage of the target_cluster_size_hint property further, to guard against an incorrectly formed cluster being used; eg we can reject any queue definitions if the number of nodes is smaller than the expected target (see https://github.com/rabbitmq/rabbitmq-server/issues/3850 for more around this)
  3. we can consider adding a check in the Operator that will validate that all nodes discovered each other and would re-create nodes (or whole clusters) in case of a failure
mkuratczyk commented 2 years ago

I realized that technically it's possible to change - we actually suggested this process (in the opposite direction) as the restart-post-outage solution: A StatefulSet can be created with OrderedReady, then deleted with --cascade=orphan and re-declared with Parallel. But my intuition would be against this - this would need to happen for every deployed cluster (since we want to end up with Parallel) and deleting and redeclaring every StatefulSet the Operator creates feels wrong and dangerous - I don't think it's a widely used pattern and if it failed for any reason, the user could end up with the Pods but without a StatefulSet or something like that. I'd rather have the Operator detect the problem and only take action when it occurred.

brunocascio commented 1 year ago
```shell
rabbitmqctl start_app

I had to do this in order to get my cluster go back to work. Running on 3.1.4 cluster operator helm chart (rabbit 3.10.11 image)

Sjd-Risca commented 4 months ago

I just confirm that the issue happen to me today as well.

I was creating a new cluster starting with the following resource:

This happen only once due to a delay into the servers availability: due to kubernetes cluster's resource shortage at first only one node of the statefulset was created, the other two servers joined later on only when new virtual machines were added to the kubernetes nodepool. Due to this timing issue into the statefull set setup, one server promote itself to leader and did not joined the other two servers which instead federate with each other.

The proposed solution of resetting the individual node via rabbitctl reset solved the issue as expected.

Just one warning: from the kubernets' point of view the cluster was healthy, with all of the three nodes deployed despite the custer itself was split in half. Only from the management interface itself it was detectable that instead the statefulset was indeed split in two subcluster (2 nodes vs 1 node).

mkuratczyk commented 4 months ago

There were significant changes to peer discovery in 3.13, which is also the only version with community support. We'll not investigate a 3.12 issue at this point.

For 4.0, we plan a complete overhaul of the k8s peer discovery plugin. I'm positive it'll be rock solid, while also being much simpler.