helm / charts

⚠️(OBSOLETE) Curated applications for Kubernetes
Apache License 2.0
15.49k stars 16.8k forks source link

[stable/rabbitmq-ha] After Master Node and reshedule, slave Nodes could not reconnect #8627

Closed quorak closed 5 years ago

quorak commented 6 years ago

Is this a request for help?: would be great to understand why this happened BUG REPORT:

Master Logs after node rescheduled

2018-10-19 05:05:14.444 [info] <0.201.0> 
 Starting RabbitMQ 3.7.7 on Erlang 20.3.4
 Copyright (C) 2007-2018 Pivotal Software, Inc.
 Licensed under the MPL.  See http://www.rabbitmq.com/

  ##  ##
  ##  ##      RabbitMQ 3.7.7. Copyright (C) 2007-2018 Pivotal Software, Inc.
  ##########  Licensed under the MPL.  See http://www.rabbitmq.com/
  ######  ##
  ##########  Logs: <stdout>

              Starting broker...
2018-10-19 05:05:14.479 [info] <0.201.0> 
 node           : rabbit@rabbitmq-rabbitmq-ha-0.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local
 home dir       : /var/lib/rabbitmq
 config file(s) : /etc/rabbitmq/rabbitmq.conf
 cookie hash    : haE63uCbnShjyV1gSacthw==
 log(s)         : <stdout>
 database dir   : /var/lib/rabbitmq/mnesia/rabbit@rabbitmq-rabbitmq-ha-0.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local
2018-10-19 05:05:19.442 [info] <0.209.0> Memory high watermark set to 244 MiB (256000000 bytes) of 3855 MiB (4042719232 bytes) total
2018-10-19 05:05:19.448 [info] <0.211.0> Enabling free disk space monitoring
2018-10-19 05:05:19.448 [info] <0.211.0> Disk free limit set to 50MB
2018-10-19 05:05:19.451 [info] <0.213.0> Limiting to approx 1048476 file handles (943626 sockets)
2018-10-19 05:05:19.451 [info] <0.214.0> FHC read buffering:  OFF
2018-10-19 05:05:19.451 [info] <0.214.0> FHC write buffering: ON
2018-10-19 05:05:19.455 [info] <0.201.0> Node database directory at /var/lib/rabbitmq/mnesia/rabbit@rabbitmq-rabbitmq-ha-0.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local is empty. Assuming we need to join an existing cluster or initialise from scratch...
2018-10-19 05:05:19.455 [info] <0.201.0> Configured peer discovery backend: rabbit_peer_discovery_k8s
2018-10-19 05:05:19.455 [info] <0.201.0> Will try to lock with peer discovery backend rabbit_peer_discovery_k8s
2018-10-19 05:05:19.455 [info] <0.201.0> Peer discovery backend does not support locking, falling back to randomized delay
2018-10-19 05:05:19.455 [info] <0.201.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping randomized startup delay.
2018-10-19 05:05:19.509 [info] <0.201.0> k8s endpoint listing returned nodes not yet ready: rabbitmq-rabbitmq-ha-2, rabbitmq-rabbitmq-ha-0, rabbitmq-rabbitmq-ha-1
2018-10-19 05:05:19.509 [info] <0.201.0> All discovered existing cluster peers: 
2018-10-19 05:05:19.509 [info] <0.201.0> Discovered no peer nodes to cluster with
2018-10-19 05:05:19.513 [info] <0.33.0> Application mnesia exited with reason: stopped
2018-10-19 05:05:19.537 [info] <0.33.0> Application mnesia started on node 'rabbit@rabbitmq-rabbitmq-ha-0.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local'
2018-10-19 05:05:19.612 [info] <0.201.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2018-10-19 05:05:19.668 [info] <0.201.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2018-10-19 05:05:19.732 [info] <0.201.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2018-10-19 05:05:19.732 [info] <0.201.0> Peer discovery backend rabbit_peer_discovery_k8s does not support registration, skipping registration.
2018-10-19 05:05:19.746 [info] <0.201.0> Priority queues enabled, real BQ is rabbit_variable_queue
2018-10-19 05:05:19.758 [info] <0.388.0> Starting rabbit_node_monitor
2018-10-19 05:05:19.806 [info] <0.201.0> message_store upgrades: 1 to apply
2018-10-19 05:05:19.806 [info] <0.201.0> message_store upgrades: Applying rabbit_variable_queue:move_messages_to_vhost_store
2018-10-19 05:05:19.806 [info] <0.201.0> message_store upgrades: No durable queues found. Skipping message store migration
2018-10-19 05:05:19.806 [info] <0.201.0> message_store upgrades: Removing the old message store data
2018-10-19 05:05:19.808 [info] <0.201.0> message_store upgrades: All upgrades applied successfully
2018-10-19 05:05:19.855 [info] <0.201.0> Management plugin: using rates mode 'basic'
2018-10-19 05:05:19.858 [info] <0.201.0> Applying definitions from: /etc/definitions/definitions.json
2018-10-19 05:05:19.858 [info] <0.201.0> Asked to import definitions. Acting user: <<"rmq-internal">>
2018-10-19 05:05:19.858 [info] <0.201.0> Importing users...
2018-10-19 05:05:19.858 [info] <0.201.0> Creating user 'management'
2018-10-19 05:05:19.860 [info] <0.201.0> Setting user tags for user 'management' to [management]
2018-10-19 05:05:19.863 [info] <0.201.0> Creating user 'guest'
2018-10-19 05:05:19.865 [info] <0.201.0> Setting user tags for user 'guest' to [administrator]
2018-10-19 05:05:19.866 [info] <0.201.0> Importing vhosts...
2018-10-19 05:05:19.866 [info] <0.201.0> Adding vhost '/'
2018-10-19 05:05:19.882 [info] <0.440.0> Making sure data directory '/var/lib/rabbitmq/mnesia/rabbit@rabbitmq-rabbitmq-ha-0.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L' for vhost '/' exists
2018-10-19 05:05:19.887 [info] <0.440.0> Starting message stores for vhost '/'
2018-10-19 05:05:19.888 [info] <0.444.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_transient": using rabbit_msg_store_ets_index to provide index
2018-10-19 05:05:19.891 [info] <0.440.0> Started message store of type transient for vhost '/'
2018-10-19 05:05:19.891 [info] <0.447.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": using rabbit_msg_store_ets_index to provide index
2018-10-19 05:05:19.893 [warning] <0.447.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": rebuilding indices from scratch
2018-10-19 05:05:19.894 [info] <0.440.0> Started message store of type persistent for vhost '/'
2018-10-19 05:05:19.896 [info] <0.201.0> Importing user permissions...
2018-10-19 05:05:19.896 [info] <0.201.0> Setting permissions for 'guest' in '/' to '.*', '.*', '.*'
2018-10-19 05:05:19.898 [info] <0.201.0> Importing topic permissions...
2018-10-19 05:05:19.898 [info] <0.201.0> Importing parameters...
2018-10-19 05:05:19.898 [info] <0.201.0> Importing global parameters...
2018-10-19 05:05:19.898 [info] <0.201.0> Importing policies...
2018-10-19 05:05:19.898 [info] <0.201.0> Importing queues...
2018-10-19 05:05:19.898 [info] <0.201.0> Importing exchanges...
2018-10-19 05:05:19.898 [info] <0.201.0> Importing bindings...
2018-10-19 05:05:19.901 [info] <0.483.0> started TCP Listener on [::]:5672
2018-10-19 05:05:19.905 [info] <0.201.0> Setting up a table for connection tracking on this node: 'tracked_connection_on_node_rabbit@rabbitmq-rabbitmq-ha-0.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local'
2018-10-19 05:05:19.910 [info] <0.201.0> Setting up a table for per-vhost connection counting on this node: 'tracked_connection_per_vhost_on_node_rabbit@rabbitmq-rabbitmq-ha-0.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local'
2018-10-19 05:05:19.911 [info] <0.33.0> Application rabbit started on node 'rabbit@rabbitmq-rabbitmq-ha-0.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local'
2018-10-19 05:05:19.912 [info] <0.33.0> Application rabbitmq_federation started on node 'rabbit@rabbitmq-rabbitmq-ha-0.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local'
2018-10-19 05:05:19.913 [info] <0.33.0> Application rabbitmq_consistent_hash_exchange started on node 'rabbit@rabbitmq-rabbitmq-ha-0.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local'
2018-10-19 05:05:19.919 [info] <0.33.0> Application rabbitmq_management_agent started on node 'rabbit@rabbitmq-rabbitmq-ha-0.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local'
2018-10-19 05:05:19.919 [info] <0.540.0> Peer discovery: enabling node cleanup (will only log warnings). Check interval: 10 seconds.
2018-10-19 05:05:19.919 [info] <0.33.0> Application rabbitmq_peer_discovery_common started on node 'rabbit@rabbitmq-rabbitmq-ha-0.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local'
2018-10-19 05:05:19.920 [info] <0.33.0> Application rabbitmq_peer_discovery_k8s started on node 'rabbit@rabbitmq-rabbitmq-ha-0.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local'
2018-10-19 05:05:19.920 [info] <0.33.0> Application rabbitmq_amqp1_0 started on node 'rabbit@rabbitmq-rabbitmq-ha-0.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local'
2018-10-19 05:05:19.922 [info] <0.33.0> Application rabbitmq_shovel started on node 'rabbit@rabbitmq-rabbitmq-ha-0.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local'
2018-10-19 05:05:19.923 [info] <0.33.0> Application cowboy started on node 'rabbit@rabbitmq-rabbitmq-ha-0.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local'
2018-10-19 05:05:19.923 [info] <0.33.0> Application rabbitmq_web_dispatch started on node 'rabbit@rabbitmq-rabbitmq-ha-0.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local'
2018-10-19 05:05:19.996 [info] <0.565.0> Management plugin started. Port: 15672
2018-10-19 05:05:19.997 [info] <0.671.0> Statistics database started.
2018-10-19 05:05:19.999 [info] <0.33.0> Application rabbitmq_management started on node 'rabbit@rabbitmq-rabbitmq-ha-0.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local'
2018-10-19 05:05:19.999 [info] <0.33.0> Application rabbitmq_shovel_management started on node 'rabbit@rabbitmq-rabbitmq-ha-0.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local'
2018-10-19 05:05:19.999 [info] <0.33.0> Application rabbitmq_federation_management started on node 'rabbit@rabbitmq-rabbitmq-ha-0.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local'
 completed with 11 plugins.
2018-10-19 05:05:20.432 [info] <0.5.0> Server startup complete; 11 plugins started.
 * rabbitmq_federation_management
 * rabbitmq_shovel_management
 * rabbitmq_management
 * rabbitmq_web_dispatch
 * rabbitmq_shovel
 * rabbitmq_amqp1_0
 * rabbitmq_peer_discovery_k8s
 * rabbitmq_peer_discovery_common
 * rabbitmq_management_agent
 * rabbitmq_consistent_hash_exchange
 * rabbitmq_federation
2018-10-19 05:05:23.869 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 05:05:36.488 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 05:05:42.190 [info] <0.730.0> accepting AMQP connection <0.730.0> (10.244.1.31:37794 -> 10.244.4.189:5672)
2018-10-19 05:05:42.251 [info] <0.730.0> connection <0.730.0> (10.244.1.31:37794 -> 10.244.4.189:5672): user 'guest' authenticated and granted access to vhost '/'
2018-10-19 05:05:44.575 [info] <0.750.0> accepting AMQP connection <0.750.0> (10.244.1.30:53676 -> 10.244.4.189:5672)
2018-10-19 05:05:44.635 [info] <0.750.0> connection <0.750.0> (10.244.1.30:53676 -> 10.244.4.189:5672): user 'guest' authenticated and granted access to vhost '/'
2018-10-19 05:05:44.655 [error] <0.759.0> Channel error on connection <0.750.0> (10.244.1.30:53676 -> 10.244.4.189:5672, vhost: '/', user: 'guest'), channel 1:
operation basic.consume caused a channel exception not_found: no queue 'google' in vhost '/'
2018-10-19 05:05:44.671 [warning] <0.750.0> closing AMQP connection <0.750.0> (10.244.1.30:53676 -> 10.244.4.189:5672, vhost: '/', user: 'guest'):
client unexpectedly closed TCP connection
2018-10-19 05:06:37.299 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 05:06:49.781 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 05:07:10.598 [info] <0.946.0> accepting AMQP connection <0.946.0> (10.244.5.68:53998 -> 10.244.4.189:5672)
2018-10-19 05:07:10.733 [info] <0.946.0> connection <0.946.0> (10.244.5.68:53998 -> 10.244.4.189:5672): user 'guest' authenticated and granted access to vhost '/'
2018-10-19 05:07:46.001 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 05:07:57.866 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 05:08:31.024 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 05:08:43.073 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 05:09:28.456 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 05:09:40.445 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 05:10:55.543 [info] <0.1604.0> accepting AMQP connection <0.1604.0> (10.244.1.30:55792 -> 10.244.4.189:5672)
2018-10-19 05:10:55.599 [info] <0.1604.0> connection <0.1604.0> (10.244.1.30:55792 -> 10.244.4.189:5672): user 'guest' authenticated and granted access to vhost '/'
2018-10-19 05:11:38.318 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 05:11:50.225 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 05:12:28.160 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 05:12:40.073 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 05:17:07.428 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 05:17:19.290 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 05:17:54.595 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 05:18:06.620 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 05:22:33.978 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 05:22:45.708 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 05:23:16.131 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 05:23:28.363 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 05:27:55.238 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 05:28:06.978 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 05:28:38.792 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 05:28:50.657 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 05:33:20.215 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 05:33:32.239 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 05:34:01.426 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 05:34:13.351 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 05:38:44.013 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 05:38:55.771 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 05:39:24.108 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 05:39:36.051 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 05:44:12.207 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 05:44:24.092 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 05:44:53.673 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 05:45:05.711 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 05:49:33.180 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 05:49:44.942 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 05:50:17.647 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 05:50:29.600 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 05:54:56.924 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 05:55:08.766 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 05:55:46.877 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 05:55:59.106 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 06:00:20.455 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 06:00:32.520 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 06:01:16.937 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 06:01:29.083 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 06:05:49.786 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 06:06:01.762 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 06:06:48.750 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 06:07:00.798 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 06:11:18.387 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 06:11:30.108 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 06:12:11.716 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 06:12:23.942 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 06:16:46.670 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 06:16:58.591 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 06:17:32.201 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 06:17:44.266 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 06:22:21.429 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up
2018-10-19 06:22:33.241 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' down: connection_closed
2018-10-19 06:23:02.249 [info] <0.388.0> node 'rabbit@rabbitmq-rabbitmq-ha-1.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' up

Slave output (no rescheduling of pod)

2018-10-21 08:06:04.837 [info] <0.33.0> Application lager started on node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local'
2018-10-21 08:06:08.378 [info] <0.33.0> Application mnesia started on node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local'
2018-10-21 08:06:08.380 [info] <0.33.0> Application mnesia exited with reason: stopped
2018-10-21 08:06:08.979 [info] <0.33.0> Application recon started on node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local'
2018-10-21 08:06:08.979 [info] <0.33.0> Application xmerl started on node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local'
2018-10-21 08:06:08.979 [info] <0.33.0> Application amqp10_common started on node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local'
2018-10-21 08:06:08.980 [info] <0.33.0> Application crypto started on node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local'
2018-10-21 08:06:08.980 [info] <0.33.0> Application cowlib started on node 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local'
2018-10-21 08:06:09.080 [error] <0.165.0> Mnesia('rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local'): ** ERROR ** (could not write core file: eacces)
 ** FATAL ** Failed to merge schema: Bad cookie in table definition 'tracked_connection_per_vhost_on_node_rabbit@rabbitmq-rabbitmq-ha-0.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local': 'rabbit@rabbitmq-rabbitmq-ha-2.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' = {cstruct,'tracked_connection_per_vhost_on_node_rabbit@rabbitmq-rabbitmq-ha-0.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local',set,['rabbit@rabbitmq-rabbitmq-ha-0.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local'],[],[],[],0,read_write,false,[],[],false,tracked_connection_per_vhost,[vhost,connection_count],[],[],[],{{1537521852877698000,-576460752303423438,1},'rabbit@rabbitmq-rabbitmq-ha-0.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local'},{{2,0},[]}}, 'rabbit@rabbitmq-rabbitmq-ha-0.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local' = {cstruct,'tracked_connection_per_vhost_on_node_rabbit@rabbitmq-rabbitmq-ha-0.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local',set,['rabbit@rabbitmq-rabbitmq-ha-0.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local'],[],[],[],0,read_write,false,[],[],false,tracked_connection_per_vhost,[vhost,connection_count],[],[],[],{{1539925519905953001,-576460752303423438,1},'rabbit@rabbitmq-rabbitmq-ha-0.rabbitmq-rabbitmq-ha-discovery.workflow.svc.cluster.local'},{{2,0},[]}}

2018-10-21 08:06:19.079 [error] <0.164.0> Supervisor mnesia_sup had child mnesia_kernel_sup started with mnesia_kernel_sup:start() at undefined exit with reason killed in context start_error
2018-10-21 08:06:19.080 [error] <0.168.0> ** Generic server mnesia_monitor terminating 
** Last message in was {'EXIT',<0.167.0>,killed}
** When Server state == {state,<0.167.0>,[],[],true,[],undefined,[],[]}
** Reason for termination == 
** killed
2018-10-21 08:06:19.080 [error] <0.171.0> ** Generic server mnesia_recover terminating 
** Last message in was {'EXIT',<0.167.0>,killed}
** When Server state == {state,<0.167.0>,undefined,undefined,undefined,0,false,true,[]}
** Reason for termination == 
** killed
2018-10-21 08:06:19.081 [error] <0.171.0> CRASH REPORT Process mnesia_recover with 0 neighbours exited with reason: killed in gen_server:decode_msg/9 line 410
2018-10-21 08:06:19.082 [error] <0.169.0> ** Generic server mnesia_subscr terminating 
** Last message in was {'EXIT',<0.167.0>,killed}
** When Server state == {state,<0.167.0>,#Ref<0.3264840277.3378642945.157025>}
** Reason for termination == 
** killed
2018-10-21 08:06:19.082 [error] <0.169.0> CRASH REPORT Process mnesia_subscr with 0 neighbours exited with reason: killed in gen_server:decode_msg/9 line 410
2018-10-21 08:06:19.082 [error] <0.168.0> CRASH REPORT Process mnesia_monitor with 0 neighbours exited with reason: killed in gen_server:decode_msg/9 line 410
2018-10-21 08:06:19.082 [error] <0.162.0> CRASH REPORT Process <0.162.0> with 0 neighbours exited with reason: {{shutdown,{failed_to_start_child,mnesia_kernel_sup,killed}},{mnesia_app,start,[normal,[]]}} in application_master:init/4 line 134
2018-10-21 08:06:19.082 [info] <0.33.0> Application mnesia exited with reason: {{shutdown,{failed_to_start_child,mnesia_kernel_sup,killed}},{mnesia_app,start,[normal,[]]}}

BOOT FAILED
===========

Error description:
    init:do_boot/3
    init:start_em/1
    rabbit:start_it/1 line 450
    rabbit:broker_start/0 line 326
    rabbit:start_apps/2 line 546
    app_utils:manage_applications/6 line 126
    lists:foldl/3 line 1263
    rabbit:'-handle_app_error/1-fun-0-'/3 line 642
throw:{could_not_start,mnesia,
          {mnesia,
              {{shutdown,{failed_to_start_child,mnesia_kernel_sup,killed}},
               {mnesia_app,start,[normal,[]]}}}}
Log file(s) (may contain more information):
   <stdout>

2018-10-21 08:06:19.082 [error] <0.5.0> 
Error description:
    init:do_boot/3
    init:start_em/1
    rabbit:start_it/1 line 450
    rabbit:broker_start/0 line 326
    rabbit:start_apps/2 line 546
    app_utils:manage_applications/6 line 126
    lists:foldl/3 line 1263
    rabbit:'-handle_app_error/1-fun-0-'/3 line 642
throw:{could_not_start,mnesia,
          {mnesia,
              {{shutdown,{failed_to_start_child,mnesia_kernel_sup,killed}},
               {mnesia_app,start,[normal,[]]}}}}
Log file(s) (may contain more information):
   <stdout>
2018-10-21 08:06:19.082 [info] <0.33.0> Application cowlib exited with reason: stopped
2018-10-21 08:06:19.082 [info] <0.33.0> Application crypto exited with reason: stopped
2018-10-21 08:06:19.082 [info] <0.33.0> Application amqp10_common exited with reason: stopped
2018-10-21 08:06:19.082 [info] <0.33.0> Application xmerl exited with reason: stopped
2018-10-21 08:06:19.082 [info] <0.33.0> Application recon exited with reason: stopped
{"init terminating in do_boot",{could_not_start,mnesia,{mnesia,{{shutdown,{failed_to_start_child,mnesia_kernel_sup,killed}},{mnesia_app,start,[normal,[]]}}}}}
init terminating in do_boot ({could_not_start,mnesia,{mnesia,{{shutdown,{_}},{mnesia_app,start,[_]}}}})

Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done

Version of Helm and Kubernetes: helm: v2.10.0 kubernetes 1.9

Which chart: rabbitmq-ha-1.9.1

What happened: It looks like one node was down and kubernetes scheduled the master pod to a new node. after successsfull startup of the master node, the slaves could not reconnect. when I deleted the slaves pod. reconnection worked.

What you expected to happen: reschedule would work out of the box

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know:

steven-sheehy commented 6 years ago

Bad cookie in table definition The cookies have to be set the same and not randomly generated for all replicas. Verify it is the same both in logs and on container filesystem.

Are you using a persistent volume claim? Persistent volumes can also be topology aware and may restrict rescheduling. Local volumes, for example, don't allow pods to be rescheduled.

quorak commented 6 years ago

thanks for the replay. I use the chart templates with their default values. the node was down and did not came up. so rescheduling to a different node was necessary. Is this supported by the chart?

steven-sheehy commented 6 years ago

You should set rabbitmqErlangCookie yourself and set persistentVolume.enabled=true if you don't want trouble reconnecting. And you can set podAntiAffinity=hard to ensure they get scheduled on different nodes.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale[bot] commented 5 years ago

This issue is being automatically closed due to inactivity.