RabbittMQ Server takes about 3-4 minutes to start with autocluster enabled

hvishwanath commented 7 years ago

I am trying to setup a rmq cluster with autocluster plugin. This is what I am observing:

With version 0.7.0 : rmq server takes about 3-4 minutes to start with this plugin enabled.
With version 0.6.1 : Works perfectly fine!

The only difference I see b/w 0.6.1 and 0.7.0 is that 0.7.0 requires autocluster_aws plugin as a dependency. I also see that with 0.7.0 (with 3-4 min delay), I see the following error log:

=ERROR REPORT==== 19-May-2017::02:22:26 ===
Failed to retrieve AWS credentials: undefined

This is with the backend configured as etcd. Not sure why aws is checked for.

Below is my config (along with RABBITMQ_USE_LONGNAME=true ):

[
  {rabbit, [
    {tcp_listen_options, [
                          {backlog,       128},
                          {nodelay,       true},
                          {linger,        {true,0}},
                          {exit_on_close, false},
                          {sndbuf,        12000},
                          {recbuf,        12000}
                         ]},
    {loopback_users, []}
  ]},

  {autocluster, [
    {dummy_param_without_comma, true},
    {backend, etcd},
    {autocluster_failure, stop},
    {cleanup_interval, 30},
    {cluster_cleanup, true},
    {cleanup_warn_only, false},
    {etcd_ttl, 30},
    {etcd_scheme, http},
    {etcd_host, "etcd.kube-system.svc.cluster.local"},
    {etcd_port, 2379}
   ]}
].

michaelklishin commented 7 years ago

@hvishwanath please enable debug logging and post what's produced in both cases.

hvishwanath commented 7 years ago

@michaelklishin : There are no logs that really point to why the broker takes so long to start up.

This is what I see with v0.7.0:

root@havishwa-multi-worker-all-services-1:/opt/maglev/srv/rabbitmq# tail -f /var/log/rabbitmq/rabbit\@10.195.127.233.log

=ERROR REPORT==== 17-May-2017::13:38:59 ===
Failed to retrieve AWS credentials: undefined

=INFO REPORT==== 17-May-2017::13:38:59 ===
autocluster: (cleanup) Timer started {30,false}

=INFO REPORT==== 17-May-2017::13:38:59 ===
Starting RabbitMQ 3.6.9 on Erlang 18.3
Copyright (C) 2007-2016 Pivotal Software, Inc.
Licensed under the MPL.  See http://www.rabbitmq.com/

=INFO REPORT==== 17-May-2017::13:38:59 ===
node           : rabbit@10.195.127.233
home dir       : /root
config file(s) : /etc/rabbitmq/rabbitmq.config
cookie hash    : STdhXwYYAi0vzoYORYQmzA==
log            : /var/log/rabbitmq/rabbit@10.195.127.233.log
sasl log       : /var/log/rabbitmq/rabbit@10.195.127.233-sasl.log
database dir   : /var/lib/rabbitmq/mnesia/rabbit@10.195.127.233

=INFO REPORT==== 17-May-2017::13:39:00 ===
autocluster: Delaying startup for 3453ms.

=INFO REPORT==== 17-May-2017::13:39:04 ===
autocluster: Starting etcd registration.

=INFO REPORT==== 17-May-2017::13:39:04 ===
autocluster: Registering node with etcd.

=INFO REPORT==== 17-May-2017::13:39:04 ===
Memory limit set to 12869MB of 32174MB total.

=INFO REPORT==== 17-May-2017::13:39:04 ===
Disk free limit set to 50MB

=INFO REPORT==== 17-May-2017::13:39:04 ===
Limiting to approx 924 file handles (829 sockets)

=INFO REPORT==== 17-May-2017::13:39:04 ===
FHC read buffering:  OFF
FHC write buffering: ON

=INFO REPORT==== 17-May-2017::13:39:04 ===
Waiting for Mnesia tables for 30000 ms, 9 retries left

=INFO REPORT==== 17-May-2017::13:39:04 ===
Waiting for Mnesia tables for 30000 ms, 9 retries left

=INFO REPORT==== 17-May-2017::13:39:04 ===
Priority queues enabled, real BQ is rabbit_variable_queue

=INFO REPORT==== 17-May-2017::13:39:04 ===
Starting rabbit_node_monitor

=INFO REPORT==== 17-May-2017::13:39:04 ===
Management plugin: using rates mode 'basic'

=INFO REPORT==== 17-May-2017::13:39:04 ===
msg_store_transient: using rabbit_msg_store_ets_index to provide index

=INFO REPORT==== 17-May-2017::13:39:04 ===
msg_store_persistent: using rabbit_msg_store_ets_index to provide index

=WARNING REPORT==== 17-May-2017::13:39:04 ===
msg_store_persistent: rebuilding indices from scratch

=INFO REPORT==== 17-May-2017::13:39:04 ===
started TCP Listener on [::]:5672

=INFO REPORT==== 17-May-2017::13:39:04 ===
autocluster: Starting node key update TTL Timer

=INFO REPORT==== 17-May-2017::13:39:04 ===
Management plugin started. Port: 15672

=INFO REPORT==== 17-May-2017::13:39:04 ===
Statistics database started.

=INFO REPORT==== 17-May-2017::13:39:04 ===
Server startup complete; 12 plugins started.
 * rabbitmq_shovel_management
 * rabbitmq_federation_management
 * rabbitmq_management
 * rabbitmq_web_dispatch
 * cowboy
 * cowlib
 * rabbitmq_federation
 * rabbitmq_shovel
 * rabbitmq_management_agent
 * amqp_client
 * autocluster
 * rabbitmq_aws

This is what it is with v0.6.1:

root@rabbitmq-0:/usr/lib/rabbitmq/bin# ./rabbitmq-server start

=INFO REPORT==== 19-May-2017::02:42:39 ===
Starting RabbitMQ 3.6.9 on Erlang 18.3
Copyright (C) 2007-2016 Pivotal Software, Inc.
Licensed under the MPL.  See http://www.rabbitmq.com/

              RabbitMQ 3.6.9. Copyright (C) 2007-2016 Pivotal Software, Inc.
  ##  ##      Licensed under the MPL.  See http://www.rabbitmq.com/
  ##  ##
  ##########  Logs: tty
  ######  ##        /var/log/rabbitmq/rabbit@10.168.172.223-sasl.log
  ##########
              Starting broker...

=INFO REPORT==== 19-May-2017::02:42:39 ===
node           : rabbit@10.168.172.223
home dir       : /root
config file(s) : /etc/rabbitmq/rabbitmq.config
cookie hash    : pR5H9kY3Wra/XdLELT5hgQ==
log            : tty
sasl log       : /var/log/rabbitmq/rabbit@10.168.172.223-sasl.log
database dir   : /opt/maglev/srv/rabbitmq/rabbit@10.168.172.223

=INFO REPORT==== 19-May-2017::02:42:41 ===
autocluster: log level set to debug

=INFO REPORT==== 19-May-2017::02:42:41 ===
autocluster: Using etcd backend

=INFO REPORT==== 19-May-2017::02:42:41 ===
autocluster: Delaying startup for 3779ms.

=INFO REPORT==== 19-May-2017::02:42:45 ===
autocluster: Starting etcd registration.

=INFO REPORT==== 19-May-2017::02:42:45 ===
autocluster: GET http://etcd.kube-system.svc.cluster.local:2379/v2/keys/rabbitmq/default?recursive=true

=INFO REPORT==== 19-May-2017::02:42:45 ===
autocluster: Response: [{ok,{{"HTTP/1.1",200,"OK"},
                             [{"date","Fri, 19 May 2017 02:42:45 GMT"},
                              {"content-length","270"},
                              {"content-type","application/json"},
                              {"x-etcd-cluster-id","cdf818194e3a8c32"},
                              {"x-etcd-index","13683"},
                              {"x-raft-index","2500904"},
                              {"x-raft-term","2"}],
                             "{\"action\":\"get\",\"node\":{\"key\":\"/rabbitmq/default\",\"dir\":true,\"nodes\":[{\"key\":\"/rabbitmq/default/10.168.172.223\",\"value\":\"enabled\",\"expiration\":\"2017-05-19T02:42:47.045916053Z\",\"ttl\":2,\"modifiedIndex\":13683,\"createdIndex\":13683}],\"modifiedIndex\":204,\"createdIndex\":204}}\n"}}]

=INFO REPORT==== 19-May-2017::02:42:45 ===
autocluster: Discovered ['rabbit@10.168.172.223']

=INFO REPORT==== 19-May-2017::02:42:45 ===
autocluster: No other nodes are registed with the backend.

=INFO REPORT==== 19-May-2017::02:42:45 ===
Memory limit set to 12869MB of 32174MB total.

=INFO REPORT==== 19-May-2017::02:42:45 ===
Disk free limit set to 50MB

=INFO REPORT==== 19-May-2017::02:42:45 ===
Limiting to approx 1048476 file handles (943626 sockets)
..
..

The only difference being with v0.7.0, the first line is an error saying AWS credentials undefined.

michaelklishin commented 7 years ago

@hvishwanath if you want to get to the bottom of this, you need to collect as much information as possible, and debug log in this plugin is essential.

There are three reasons that are easy to imagine:

The startup delay setting is high
The plugin tries to contact an external service which times out
Resolution of etcd.kube-system.svc.cluster.local takes a while to time out via DNS (and then other resolution mechanisms are tried)

The former is proved wrong by the following log entry:

=INFO REPORT==== 17-May-2017::13:39:00 ===
autocluster: Delaying startup for 3453ms.

It's just about 3.5 seconds, comparable to ~ 3.7 seconds with 0.6.1.

The 3rd one can be triaged by adding an entry for etcd.kube-system.svc.cluster.local to /etc/hosts (and using dig to resolve that name).

By far the most efficient way to investigate the 2nd hypothesis is to take a traffic capture and inspect it in Wireshark.

However, this team does not use GitHub issues for investigations, root cause analysis and so on. Please take this to rabbitmq-users and if/when we learn more, we'd be happy to file a new issue if that'd be something to improve in this plugin.

michaelklishin commented 7 years ago

Thank you for your time.

Team RabbitMQ uses GitHub issues for specific actionable items engineers can work on. This assumes two things:

GitHub issues are not used for questions, investigations, root cause analysis, discussions of potential issues, etc (as defined by this team)
We have a certain amount of information to work with

We get at least a dozen of questions through various venues every single day, often quite light on details. At that rate GitHub issues can very quickly turn into a something impossible to navigate and make sense of even for our team. Because of that questions, investigations, root cause analysis, discussions fofor potential features are all considered to be mailing list material by our team. Please post this to rabbitmq-users.

Getting all the details necessary to make a conclusion or even form a hypothesis about what's happening can take a fair amount of time. Our team is multiple orders of magnitude smaller than the RabbitMQ community. Please help others help you by providing as much relevant information as possible on the list:

Server, client library and plugin (if applicable) versions used
Server logs
A code example or terminal transcript that can be used to reproduce
Full exception stack traces (not a single line message)
rabbitmqctl status (and, if possible, rabbitmqctl environment output)
Other relevant things about the environment and workload, e.g. a traffic capture

Feel free to edit out hostnames and other potentially sensitive information.

When/if we have enough details and evidence we'd be happy to file a new issue.

Thank you.

michaelklishin commented 7 years ago

@hvishwanath please leave a link to the rabbitmq-users thread when you post.

hvishwanath commented 7 years ago

@michaelklishin: Thanks. Will post to the mailer as well. But just wanted to add, it seems that rabbitmq_aws plugin always makes a call to EC2 instance metadata server irrespective of the backend configured. This is what I see in tcpdump:

root@rabbitmq-0:/# tcpdump -i eth0 dst 169.254.169.254
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes

03:01:19.720263 IP rabbitmq-0.rabbitmq.maglev-system.svc.cluster.local.38890 > 169.254.169.254.80: Flags [S], seq 4159351748, win 29200, options [mss 1460,sackOK,TS val 222882878 ecr 0,nop,wscale 7], length 0
03:01:19.720947 IP rabbitmq-0.rabbitmq.maglev-system.svc.cluster.local.38890 > 169.254.169.254.80: Flags [.], ack 285812580, win 229, options [nop,nop,TS val 222882878 ecr 873474382], length 0
03:01:19.728853 IP rabbitmq-0.rabbitmq.maglev-system.svc.cluster.local.38890 > 169.254.169.254.80: Flags [P.], seq 0:134, ack 1, win 229, options [nop,nop,TS val 222882880 ecr 873474382], length 134: HTTP: GET /latest/meta-data/placement/availability-zone HTTP/1.1

and it waits there till retrys and timeouts are done.

michaelklishin commented 7 years ago

@hvishwanath so something is trying to contact an AWS-specific instance metadata resource even though you are using etcd. Thanks, that's enough for us to investigate.

dcorbacho commented 7 years ago

This is https://github.com/gmr/rabbitmq-aws/issues/2

michaelklishin commented 7 years ago

The underlying issue turned out to be rabbitmq/rabbitmq-aws#3. Thank you for providing the details, @hvishwanath!

michaelklishin commented 7 years ago

@hvishwanath can you please give https://github.com/rabbitmq/rabbitmq-autocluster/releases/tag/0.8.0-milestone2 a try?

hvishwanath commented 7 years ago

@michaelklishin @dcorbacho Awesome! Thanks - works fine.

Any idea when I can expect a 0.8.0 general release? Can I use the current milestone-2 release in production?

michaelklishin commented 7 years ago

@hvishwanath in a week or two, depending on how soon #22 lands. We can't tell you whether you should use this milestone but we consider it to be worth trying, so give it a shot in your QA environment and see how it goes.

rabbitmq / rabbitmq-autocluster

RabbittMQ Server takes about 3-4 minutes to start with autocluster enabled #27