rabbitmq / rabbitmq-peer-discovery-aws

AWS-based peer discovery backend for RabbitMQ 3.7.0+
Other
24 stars 11 forks source link

RabbitMQ dies when nodes inside ASG cannot properly communicate #9

Closed chuckyz closed 6 years ago

chuckyz commented 6 years ago

Hello,

RabbitMQ dies with the following error when a node cannot properly communicate with other RabbitMQ nodes inside an ASG.

2017-12-11 23:40:51 =CRASH REPORT====
  crasher:
    initial call: application_master:init/4
    pid: <0.192.0>
    registered_name: []
    exception exit: {{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{timeout,{gen_server,call,[rabbitmq_aws,{request,"autoscaling",get,[],"/?Action=DescribeAutoScalingInstances&Version=2011-01-01",[],[],undefined}]}}}}},[{application_master,init,4,[{file,"application_master.erl"},{line,134}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}
    ancestors: [<0.191.0>]
    message_queue_len: 1
    messages: [{'EXIT',<0.193.0>,normal}]
    links: [<0.191.0>,<0.33.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 987
    stack_size: 27
    reductions: 157
  neighbours:
2017-12-11 23:40:55 =SUPERVISOR REPORT====
     Supervisor: {local,httpc_handler_sup}
     Context:    shutdown_error
     Reason:     killed
     Offender:   [{nb_children,1},{id,undefined},{mfargs,{httpc_handler,start_link,[]}},{restart_type,temporary},{shutdown,4000},{child_type,worker}]

This can be resolved by correcting security group/subnet/VPC-networking issues; however, instances should probably throw errors instead of outright crashing.

This is with the following versions of things:

RPMs:

"https://dl.bintray.com/rabbitmq/all/rabbitmq-server/3.7.0/rabbitmq-server-3.7.0-1.el7.noarch.rpm"
"https://dl.bintray.com/rabbitmq/rpm/erlang/20/el/7/x86_64/erlang-20.1.7-1.el7.centos.x86_64.rpm"

RabbitMQ:

{running_applications,
     [{rabbitmq_peer_discovery_aws,
          "AWS-based RabbitMQ peer discovery backend","3.7.0"},
      {rabbitmq_peer_discovery_common,
          "Modules shared by various peer discovery backends","3.7.0"},
      {rabbitmq_management,"RabbitMQ Management Console","3.7.0"},
      {amqp_client,"RabbitMQ AMQP Client","3.7.0"},
      {rabbitmq_management_agent,"RabbitMQ Management Agent","3.7.0"},
      {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.7.0"},
      {rabbit,"RabbitMQ","3.7.0"},
      {rabbit_common,
          "Modules shared by rabbitmq-server and rabbitmq-erlang-client",
          "3.7.0"},
      {recon,"Diagnostic tools for production use","2.3.2"},
      {ranch_proxy_protocol,"Ranch Proxy Protocol Transport","1.4.2"},
      {cowboy,"Small, fast, modern HTTP server.","2.0.0"},
      {ranch,"Socket acceptor pool for TCP protocols.","1.4.0"},
      {rabbitmq_aws,
          "A minimalistic AWS API interface used by rabbitmq-autocluster (3.6.x) and other RabbitMQ plugins",
          "3.7.0"},
      {ssl,"Erlang/OTP SSL application","8.2.2"},
      {public_key,"Public key infrastructure","1.5.1"},
      {asn1,"The Erlang ASN1 compiler version 5.0.3","5.0.3"},
      {cowlib,"Support library for manipulating Web protocols.","2.0.0"},
      {crypto,"CRYPTO","4.1"},
      {xmerl,"XML parser","1.3.15"},
      {mnesia,"MNESIA  CXC 138 12","4.15.1"},
      {inets,"INETS  CXC 138 49","6.4.4"},
      {jsx,"a streaming, evented json parsing toolkit","2.8.2"},
      {os_mon,"CPO  CXC 138 46","2.4.3"},
      {lager,"Erlang logging framework","3.5.1"},
      {goldrush,"Erlang event stream processor","0.1.9"},
      {compiler,"ERTS  CXC 138 10","7.1.3"},
      {syntax_tools,"Syntax tools","2.1.3"},
      {sasl,"SASL  CXC 138 11","3.1"},
      {stdlib,"ERTS  CXC 138 10","3.4.2"},
      {kernel,"ERTS  CXC 138 10","5.4"}]},
 {os,{unix,linux}},
 {erlang_version,
     "Erlang/OTP 20 [erts-9.1.5] [source] [64-bit] [smp:1:1] [ds:1:1:10] [async-threads:64] [hipe] [kernel-poll:true]\n"},

The config file is as follows:

cluster_formation.peer_discovery_backend = rabbit_peer_discovery_aws
cluster_formation.aws.region = us-west-2
cluster_formation.aws.use_autoscaling_group = true
cluster_formation.aws.use_private_ip = true
cluster_formation.node_cleanup.only_log_warning = false

management.load_definitions = /etc/rabbitmq/definitions.json
michaelklishin commented 6 years ago

Inability to contact critically important EC2 endpoints is considered to be fatal by this plugin. While the plugin could handle it and log something more sensible, consider this to be by design.