Closed TwitchChen closed 5 years ago
Thank you for your time.
Team RabbitMQ uses GitHub issues for specific actionable items engineers can work on. GitHub issues are not used for questions, investigations, root cause analysis, discussions of potential issues, etc (as defined by this team).
We get at least a dozen of questions through various venues every single day, often light on details. At that rate GitHub issues can very quickly turn into a something impossible to navigate and make sense of even for our team. Because GitHub is a tool our team uses heavily nearly every day, the signal/noise ratio of issues is something we care about a lot.
Please post this to rabbitmq-users.
Thank you.
See How to Troubleshoot Peer Discovery (hint: most decisions are logged at debug
level), How Does Peer Discovery Work (and when it is not performed), and finally, some Kubernetes-specific prerequisites.
While not really applicable to Kubernets as the log message says, the range of values in cluster_formation.randomized_startup_delay_range
used in your config is very narrow and too unlikely to address the problem it was designed to address. The default range is [5, 60]
. With [0, 2]
for the range both nodes will effectively boot in parallel.
On an unrelated note, two node clusters are highly discouraged because computing a majority of nodes in case of connectivity loss is impossible. Some features in 3.8 will require a 3+ node cluster.
One of the log files contains the following clue:
2019-09-29 09:47:22.846 [info] <0.234.0> k8s endpoint listing returned nodes not yet ready: rabbitmq-0 2019-09-29 09:47:22.846 [info] <0.234.0> All discovered existing cluster peers:
According to the Kubernetes API, the pod of the discover peer is not yet initialised. This is the case when the pods are booting in parallel. See this rabbitmq-users thread, for example. The docs now explicitly warn about this:
Stateless sets are also prone to the natural race condition during initial cluster formation, unlike stateful sets that initialise pods one by one.
Peer discovery mechanism will filter out nodes whose pods are not yet ready (initialised) according to the Kubernetes API. For example, if pod management policy of a stateful set is set to > Parallel, some nodes can be discovered but will not be joined.
It is therefore necessary to use OrderedReady pod management policy for the sets used by RabbitMQ nodes. This policy is used by default by Kubernetes.
@michaelklishin Thank you for your reply.I have a another try without the "cluster_formation.k8s.address_type = hostname" , and other configurations do not change like "cluster_formation.randomized_startup_delay_range".
rabbitmq-0' s log
2019-09-29 02:29:33.786 [info] <0.219.0>
Starting RabbitMQ 3.7.16 on Erlang 22.0.7
Copyright (C) 2007-2019 Pivotal Software, Inc.
Licensed under the MPL. See https://www.rabbitmq.com/
## ##
## ## RabbitMQ 3.7.16. Copyright (C) 2007-2019 Pivotal Software, Inc.
########## Licensed under the MPL. See https://www.rabbitmq.com/
###### ##
########## Logs: <stdout>
Starting broker...
2019-09-29 02:29:33.794 [info] <0.219.0>
node : rabbit@172.31.92.92
home dir : /var/lib/rabbitmq
config file(s) : /etc/rabbitmq/rabbitmq.conf
cookie hash : XhdCf8zpVJeJ0EHyaxszPg==
log(s) : <stdout>
database dir : /var/lib/rabbitmq/mnesia/rabbit@172.31.92.92
2019-09-29 02:29:36.066 [info] <0.219.0> Running boot step pre_boot defined by app rabbit
2019-09-29 02:29:36.066 [info] <0.219.0> Running boot step rabbit_core_metrics defined by app rabbit
2019-09-29 02:29:36.069 [info] <0.219.0> Running boot step rabbit_alarm defined by app rabbit
2019-09-29 02:29:36.078 [info] <0.227.0> Memory high watermark set to 1907 MiB (2000000000 bytes) of 515124 MiB (540147003392 bytes) total
2019-09-29 02:29:36.086 [info] <0.229.0> Enabling free disk space monitoring
2019-09-29 02:29:36.086 [info] <0.229.0> Disk free limit set to 4000MB
2019-09-29 02:29:36.094 [info] <0.219.0> Running boot step code_server_cache defined by app rabbit
2019-09-29 02:29:36.094 [info] <0.219.0> Running boot step file_handle_cache defined by app rabbit
2019-09-29 02:29:36.095 [info] <0.232.0> Limiting to approx 65436 file handles (58890 sockets)
2019-09-29 02:29:36.095 [info] <0.233.0> FHC read buffering: OFF
2019-09-29 02:29:36.095 [info] <0.233.0> FHC write buffering: ON
2019-09-29 02:29:36.095 [info] <0.219.0> Running boot step worker_pool defined by app rabbit
2019-09-29 02:29:36.095 [info] <0.220.0> Will use 48 processes for default worker pool
2019-09-29 02:29:36.095 [info] <0.220.0> Starting worker pool 'worker_pool' with 48 processes in it
2019-09-29 02:29:36.098 [info] <0.219.0> Running boot step database defined by app rabbit
2019-09-29 02:29:36.098 [info] <0.219.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@172.31.92.92 is empty. Assuming we need to join an existing cluster or initialise from scratch...
2019-09-29 02:29:36.098 [info] <0.219.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2019-09-29 02:29:36.098 [info] <0.219.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2019-09-29 02:29:36.098 [info] <0.219.0> Peer discovery backend does not support locking, falling back to randomized delay
2019-09-29 02:29:36.099 [info] <0.219.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2019-09-29 02:29:36.129 [info] <0.219.0> All discovered existing cluster peers: rabbit@172.31.92.92
2019-09-29 02:29:36.129 [info] <0.219.0> Discovered no peer nodes to cluster with
2019-09-29 02:29:36.134 [info] <0.43.0> Application mnesia exited with reason: stopped
2019-09-29 02:29:36.179 [info] <0.219.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2019-09-29 02:29:36.221 [info] <0.219.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2019-09-29 02:29:36.261 [info] <0.219.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2019-09-29 02:29:36.262 [info] <0.219.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping registration.
2019-09-29 02:29:36.262 [info] <0.219.0> Running boot step database_sync defined by app rabbit
2019-09-29 02:29:36.262 [info] <0.219.0> Running boot step codec_correctness_check defined by app rabbit
2019-09-29 02:29:36.262 [info] <0.219.0> Running boot step external_infrastructure defined by app rabbit
2019-09-29 02:29:36.262 [info] <0.219.0> Running boot step rabbit_registry defined by app rabbit
2019-09-29 02:29:36.262 [info] <0.219.0> Running boot step rabbit_auth_mechanism_cr_demo defined by app rabbit
2019-09-29 02:29:36.262 [info] <0.219.0> Running boot step rabbit_queue_location_random defined by app rabbit
2019-09-29 02:29:36.262 [info] <0.219.0> Running boot step rabbit_event defined by app rabbit
2019-09-29 02:29:36.262 [info] <0.219.0> Running boot step rabbit_auth_mechanism_amqplain defined by app rabbit
2019-09-29 02:29:36.263 [info] <0.219.0> Running boot step rabbit_auth_mechanism_plain defined by app rabbit
2019-09-29 02:29:36.263 [info] <0.219.0> Running boot step rabbit_exchange_type_direct defined by app rabbit
2019-09-29 02:29:36.263 [info] <0.219.0> Running boot step rabbit_exchange_type_fanout defined by app rabbit
2019-09-29 02:29:36.263 [info] <0.219.0> Running boot step rabbit_exchange_type_headers defined by app rabbit
2019-09-29 02:29:36.263 [info] <0.219.0> Running boot step rabbit_exchange_type_topic defined by app rabbit
2019-09-29 02:29:36.263 [info] <0.219.0> Running boot step rabbit_mirror_queue_mode_all defined by app rabbit
2019-09-29 02:29:36.263 [info] <0.219.0> Running boot step rabbit_mirror_queue_mode_exactly defined by app rabbit
2019-09-29 02:29:36.264 [info] <0.219.0> Running boot step rabbit_mirror_queue_mode_nodes defined by app rabbit
2019-09-29 02:29:36.264 [info] <0.219.0> Running boot step rabbit_priority_queue defined by app rabbit
2019-09-29 02:29:36.264 [info] <0.219.0> Priority queues enabled, real BQ is rabbit_variable_queue
2019-09-29 02:29:36.264 [info] <0.219.0> Running boot step rabbit_queue_location_client_local defined by app rabbit
2019-09-29 02:29:36.264 [info] <0.219.0> Running boot step rabbit_queue_location_min_masters defined by app rabbit
2019-09-29 02:29:36.264 [info] <0.219.0> Running boot step kernel_ready defined by app rabbit
2019-09-29 02:29:36.264 [info] <0.219.0> Running boot step rabbit_sysmon_minder defined by app rabbit
2019-09-29 02:29:36.264 [info] <0.219.0> Running boot step rabbit_epmd_monitor defined by app rabbit
2019-09-29 02:29:36.266 [info] <0.219.0> Running boot step guid_generator defined by app rabbit
2019-09-29 02:29:36.266 [info] <0.219.0> Running boot step rabbit_node_monitor defined by app rabbit
2019-09-29 02:29:36.267 [info] <0.452.0> Starting rabbit_node_monitor
2019-09-29 02:29:36.267 [info] <0.219.0> Running boot step delegate_sup defined by app rabbit
2019-09-29 02:29:36.267 [info] <0.219.0> Running boot step rabbit_memory_monitor defined by app rabbit
2019-09-29 02:29:36.268 [info] <0.219.0> Running boot step core_initialized defined by app rabbit
2019-09-29 02:29:36.268 [info] <0.219.0> Running boot step upgrade_queues defined by app rabbit
2019-09-29 02:29:36.305 [info] <0.219.0> message_store upgrades: 1 to apply
2019-09-29 02:29:36.305 [info] <0.219.0> message_store upgrades: Applying rabbit_variable_queue:move_messages_to_vhost_store
2019-09-29 02:29:36.306 [info] <0.219.0> message_store upgrades: No durable queues found. Skipping message store migration
2019-09-29 02:29:36.306 [info] <0.219.0> message_store upgrades: Removing the old message store data
2019-09-29 02:29:36.306 [info] <0.219.0> message_store upgrades: All upgrades applied successfully
2019-09-29 02:29:36.345 [info] <0.219.0> Running boot step rabbit_connection_tracking defined by app rabbit
2019-09-29 02:29:36.345 [info] <0.219.0> Running boot step rabbit_connection_tracking_handler defined by app rabbit
2019-09-29 02:29:36.345 [info] <0.219.0> Running boot step rabbit_exchange_parameters defined by app rabbit
2019-09-29 02:29:36.345 [info] <0.219.0> Running boot step rabbit_mirror_queue_misc defined by app rabbit
2019-09-29 02:29:36.346 [info] <0.219.0> Running boot step rabbit_policies defined by app rabbit
2019-09-29 02:29:36.347 [info] <0.219.0> Running boot step rabbit_policy defined by app rabbit
2019-09-29 02:29:36.347 [info] <0.219.0> Running boot step rabbit_queue_location_validator defined by app rabbit
2019-09-29 02:29:36.347 [info] <0.219.0> Running boot step rabbit_vhost_limit defined by app rabbit
2019-09-29 02:29:36.347 [info] <0.219.0> Running boot step rabbit_mgmt_reset_handler defined by app rabbitmq_management
2019-09-29 02:29:36.347 [info] <0.219.0> Running boot step rabbit_mgmt_db_handler defined by app rabbitmq_management_agent
2019-09-29 02:29:36.347 [info] <0.219.0> Management plugin: using rates mode 'basic'
2019-09-29 02:29:36.347 [info] <0.219.0> Running boot step recovery defined by app rabbit
2019-09-29 02:29:36.348 [info] <0.219.0> Running boot step load_definitions defined by app rabbitmq_management
2019-09-29 02:29:36.348 [info] <0.219.0> Running boot step empty_db_check defined by app rabbit
2019-09-29 02:29:36.348 [info] <0.219.0> Adding vhost '/'
2019-09-29 02:29:36.354 [info] <0.493.0> Making sure data directory '/var/lib/rabbitmq/mnesia/rabbit@172.31.92.92/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L' for vhost '/' exists
2019-09-29 02:29:36.361 [info] <0.493.0> Starting message stores for vhost '/'
2019-09-29 02:29:36.361 [info] <0.497.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_transient": using rabbit_msg_store_ets_index to provide index
2019-09-29 02:29:36.362 [info] <0.493.0> Started message store of type transient for vhost '/'
2019-09-29 02:29:36.362 [info] <0.500.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": using rabbit_msg_store_ets_index to provide index
2019-09-29 02:29:36.363 [warning] <0.500.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": rebuilding indices from scratch
2019-09-29 02:29:36.370 [info] <0.493.0> Started message store of type persistent for vhost '/'
2019-09-29 02:29:36.371 [info] <0.219.0> Creating user 'guest'
2019-09-29 02:29:36.372 [info] <0.219.0> Setting user tags for user 'guest' to [administrator]
2019-09-29 02:29:36.373 [info] <0.219.0> Setting permissions for 'guest' in '/' to '.*', '.*', '.*'
2019-09-29 02:29:36.373 [info] <0.219.0> Running boot step rabbit_looking_glass defined by app rabbit
2019-09-29 02:29:36.373 [info] <0.219.0> Running boot step rabbit_core_metrics_gc defined by app rabbit
2019-09-29 02:29:36.373 [info] <0.219.0> Running boot step background_gc defined by app rabbit
2019-09-29 02:29:36.374 [info] <0.219.0> Running boot step connection_tracking defined by app rabbit
2019-09-29 02:29:36.375 [info] <0.219.0> Setting up a table for connection tracking on this node: 'tracked_connection_on_node_rabbit@172.31.92.92'
2019-09-29 02:29:36.377 [info] <0.219.0> Setting up a table for per-vhost connection counting on this node: 'tracked_connection_per_vhost_on_node_rabbit@172.31.92.92'
2019-09-29 02:29:36.377 [info] <0.219.0> Running boot step routing_ready defined by app rabbit
2019-09-29 02:29:36.377 [info] <0.219.0> Running boot step pre_flight defined by app rabbit
2019-09-29 02:29:36.377 [info] <0.219.0> Running boot step notify_cluster defined by app rabbit
2019-09-29 02:29:36.377 [info] <0.219.0> Running boot step networking defined by app rabbit
2019-09-29 02:29:36.381 [warning] <0.532.0> Setting Ranch options together with socket options is deprecated. Please use the new map syntax that allows specifying socket options separately from other options.
2019-09-29 02:29:36.383 [info] <0.546.0> started TCP listener on [::]:5672
2019-09-29 02:29:36.383 [info] <0.219.0> Running boot step direct_client defined by app rabbit
2019-09-29 02:29:36.384 [info] <0.552.0> Peer discovery: enabling node cleanup (will only log warnings). Check interval: 30 seconds.
2019-09-29 02:29:36.462 [info] <0.610.0> Management plugin: HTTP (non-TLS) listener started on port 15672
2019-09-29 02:29:36.462 [info] <0.716.0> Statistics database started.
2019-09-29 02:29:36.462 [info] <0.715.0> Starting worker pool 'management_worker_pool' with 3 processes in it
completed with 5 plugins.
2019-09-29 02:29:36.687 [info] <0.8.0> Server startup complete; 5 plugins started.
* rabbitmq_peer_discovery_k8s
* rabbitmq_management
* rabbitmq_web_dispatch
* rabbitmq_management_agent
* rabbitmq_peer_discovery_common
2019-09-29 02:30:14.075 [info] <0.452.0> node 'rabbit@172.31.92.123' up
2019-09-29 02:30:14.424 [info] <0.452.0> rabbit on node 'rabbit@172.31.92.123' up
rabbitmq_configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: rabbitmq-config
namespace: default
data:
enabled_plugins: |
[rabbitmq_management,rabbitmq_peer_discovery_k8s].
rabbitmq.conf: |
## Cluster formation. See https://www.rabbitmq.com/cluster-formation.html to learn more.
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_k8s
cluster_formation.k8s.host = kubernetes.default.svc.cluster.local
#cluster_formation.k8s.host = 10.254.0.1
cluster_formation.k8s.port = 443
cluster_formation.k8s.scheme = https
cluster_formation.k8s.cert_path = /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
cluster_formation.k8s.token_path = /var/run/secrets/kubernetes.io/serviceaccount/token
cluster_formation.k8s.namespace_path = /var/run/secrets/kubernetes.io/serviceaccount/namespace
cluster_formation.randomized_startup_delay_range.min = 0
cluster_formation.randomized_startup_delay_range.max = 2
# 必须设置service_name,否则Pod无法正常启动,这里设置后可以不设置statefulset下env中的K8S_SERVICE_NAME变量
cluster_formation.k8s.service_name = rabbitmq-headless-srv
# 必须设置hostname_suffix,否则节点不能成为集群
#cluster_formation.k8s.hostname_suffix = .rabbitmq-headless-srv.default.svc.cluster.local
## Should RabbitMQ node name be computed from the pod's hostname or IP address?
## IP addresses are not stable, so using [stable] hostnames is recommended when possible.
## Set to "hostname" to use pod hostnames.
## When this value is changed, so should the variable used to set the RABBITMQ_NODENAME
## environment variable.
#cluster_formation.k8s.address_type = hostname
## How often should node cleanup checks run?
cluster_formation.node_cleanup.interval = 30
## Set to false if automatic removal of unknown/absent nodes
## is desired. This can be dangerous, see
## * https://www.rabbitmq.com/cluster-formation.html#node-health-checks-and-cleanup
## * https://groups.google.com/forum/#!msg/rabbitmq-users/wuOfzEywHXo/k8z_HWIkBgAJ
cluster_formation.node_cleanup.only_log_warning = true
cluster_partition_handling = autoheal
## See https://www.rabbitmq.com/ha.html#master-migration-data-locality
queue_master_locator=min-masters
## See https://www.rabbitmq.com/access-control.html#loopback-users
loopback_users.guest = false
#the memory limit
vm_memory_high_watermark.absolute = 2GB
#the disk limit
disk_free_limit.absolute = 4GB
rabbitmq_statefulsets.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: rabbitmq
namespace: default
spec:
selector:
matchLabels:
app: rabbitmq
serviceName: rabbitmq-headless-srv
replicas: 2
template:
metadata:
labels:
app: rabbitmq
spec:
serviceAccountName: rabbitmq
terminationGracePeriodSeconds: 10
imagePullSecrets:
- name: default
containers:
- name: rabbitmq
image: rabbitmq:k8s
resources:
limits:
cpu: 2
memory: 3Gi
requests:
cpu: 0.5
memory: 1Gi
volumeMounts:
- name: config-volume
mountPath: /etc/rabbitmq
- name: rabbitmq-pvc
mountPath: /var/lib/rabbitmq/mnesia
ports:
- name: http
protocol: TCP
containerPort: 15672
- name: amqp
protocol: TCP
containerPort: 5672
livenessProbe:
exec:
command: ["rabbitmqctl", "status"]
initialDelaySeconds: 60
# See https://www.rabbitmq.com/monitoring.html for monitoring frequency recommendations.
periodSeconds: 60
timeoutSeconds: 5
readinessProbe:
exec:
command: ["rabbitmqctl", "status"]
initialDelaySeconds: 20
periodSeconds: 60
timeoutSeconds: 10
imagePullPolicy: IfNotPresent
env:
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: MY_POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: RABBITMQ_USE_LONGNAME
value: "true"
- name: K8S_SERVICE_NAME
value: "rabbitmq-headless-srv"
- name: RABBITMQ_NODENAME
value: rabbit@$(MY_POD_NAME)
#- name: K8S_HOSTNAME_SUFFIX
# value: ".$(K8S_SERVICE_NAME).$(MY_POD_NAMESPACE).svc.cluster.local"
- name: RABBITMQ_ERLANG_COOKIE
value: "mycookie"
volumes:
- name: config-volume
configMap:
name: rabbitmq-config
items:
- key: rabbitmq.conf
path: rabbitmq.conf
- key: enabled_plugins
path: enabled_plugins
- name: rabbitmq-pvc
hostPath:
path: /pacloud/k8s/rabbitmq
I'm not sure what the question is.
At the end of the log a peer joins the cluster:
2019-09-29 02:30:14.075 [info] <0.452.0> node 'rabbit@172.31.92.123' up 2019-09-29 02:30:14.424 [info] <0.452.0> rabbit on node 'rabbit@172.31.92.123' up
If you want to see what Kubernetes API endpoint responses return, set log level to debug
.
Previously initialised (as in data directory) nodes must be reset between clustering attemptes or they will behave as "rejoining nodes" which the docs cover.
For our team GitHub is not a support forum => I am locking this issue. Please iuse the mailing list in the future.
I'm trying to make a rabbitmq cluster witch 2 node by useing the rabbitmq-peer-discovery-k8s.But both of 2 rabbitmq node are running alone.
rabbimq-0's log
rabbitmq-1's log
rabbitmq cluster_status
rabbitmq_configmap.yaml
rabbitmq_statefulsets.yaml