rabbit_epmd_monitor:check_epmd/1 fails with {error,nxdomain}

aldobongio commented 4 years ago

In a cluster of 3 instances our Grafana/Prometheus monitoring system detected all the nodes down for 3-4 minutes, in two distinct moments. In such intervals client applications (SpringBoot apps) had issues connecting to RabbitMQ. The cluster was created and launched 26 hours before the error events and worked without problems until the events. After each error event the clustered recovered automatically without any manual intervention, but we would like to understand what happened and the causes.

The monitoring system reported the following during the downtime:

{"error":"Get http://rabbitmq1:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:04:36Z"}
{"error":"Get http://rabbitmq3:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:04:36Z"}
{"error":"Get http://rabbitmq1:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:05:12Z"}
{"error":"Get http://rabbitmq2:15672/api/overview?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:05:18Z"}
{"error":"Get http://rabbitmq3:15672/api/exchanges?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:05:35Z"}
{"error":"Get http://rabbitmq1:15672/api/queues?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:05:42Z"}
{"error":"Get http://rabbitmq2:15672/api/exchanges?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:05:48Z"}
{"error":"Get http://rabbitmq3:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:06:05Z"}
{"error":"Get http://rabbitmq1:15672/api/overview?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:06:12Z"}
{"error":"Get http://rabbitmq2:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:06:18Z"}
{"error":"Get http://rabbitmq3:15672/api/queues?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:06:35Z"}
{"error":"Get http://rabbitmq1:15672/api/exchanges?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:06:42Z"}
{"error":"Get http://rabbitmq2:15672/api/queues?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:06:48Z"}

{"error":"Get http://rabbitmq3:15672/api/overview?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:26:09Z"}
{"error":"Get http://rabbitmq1:15672/api/overview?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:26:10Z"}
{"error":"Get http://rabbitmq2:15672/api/overview?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:26:18Z"}
{"error":"Get http://rabbitmq3:15672/api/exchanges?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:26:39Z"}
{"error":"Get http://rabbitmq1:15672/api/exchanges?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:26:40Z"}
{"error":"Get http://rabbitmq2:15672/api/exchanges?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:26:48Z"}
{"error":"Get http://rabbitmq3:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:27:09Z"}
{"error":"Get http://rabbitmq1:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:27:10Z"}
{"error":"Get http://rabbitmq2:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:27:18Z"}
{"error":"Get http://rabbitmq3:15672/api/queues?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:27:39Z"}
{"error":"Get http://rabbitmq1:15672/api/queues?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:27:40Z"}
{"error":"Get http://rabbitmq2:15672/api/queues?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:27:48Z"}
{"error":"Get http://rabbitmq3:15672/api/overview?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:28:09Z"}
{"error":"Get http://rabbitmq1:15672/api/overview?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:28:10Z"}
{"error":"Get http://rabbitmq2:15672/api/overview?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:28:18Z"}
{"error":"Get http://rabbitmq3:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:28:39Z"}
{"error":"Get http://rabbitmq1:15672/api/exchanges?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:28:40Z"}
{"error":"Get http://rabbitmq2:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:28:48Z"}
{"error":"Get http://rabbitmq3:15672/api/queues?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:29:09Z"}

The logs of nodes 1 and 2 for the entire day were empty, whereas the node 3 had the following logs:

2019-10-02 21:05:52.587 [error] <0.1301.0> ** Generic server rabbit_epmd_monitor terminating
** Last message in was check
** When Server state == {state,{erlang,#Ref<0.933050920.3712483331.253384>},erl_epmd,"rabbitmq","rabbitmq3",25672}
** Reason for termination ==
** {{case_clause,{error,nxdomain}},[{rabbit_epmd_monitor,check_epmd,1,[{file,"src/rabbit_epmd_monitor.erl"},{line,100}]},{rabbit_epmd_monitor,handle_info,2,[{file,"src/rabbit_epmd_monitor.erl"},{line,79}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,637}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,711}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}
2019-10-02 21:05:52.587 [error] <0.1301.0> CRASH REPORT Process rabbit_epmd_monitor with 0 neighbours crashed with reason: no case clause matching {error,nxdomain} in rabbit_epmd_monitor:check_epmd/1 line 100
2019-10-02 21:05:52.587 [error] <0.1300.0> Supervisor rabbit_epmd_monitor_sup had child rabbit_epmd_monitor started with rabbit_epmd_monitor:start_link() at <0.1301.0> exit with reason no case clause matching {error,nxdomain} in rabbit_epmd_monitor:check_epmd/1 line 100 in context child_terminated
2019-10-02 21:06:00.588 [error] <0.15921.1> CRASH REPORT Process <0.15921.1> with 0 neighbours crashed with reason: no function clause matching rabbit_epmd_monitor:init_handle_port_please({error,nxdomain}, erl_epmd, "rabbitmq", "rabbitmq3") line 59
2019-10-02 21:06:00.589 [error] <0.1300.0> Supervisor rabbit_epmd_monitor_sup had child rabbit_epmd_monitor started with rabbit_epmd_monitor:start_link() at <0.1301.0> exit with reason no function clause matching rabbit_epmd_monitor:init_handle_port_please({error,nxdomain}, erl_epmd, "rabbitmq", "rabbitmq3") line 59 in context start_error

2019-10-02 23:29:03.236 [error] <0.15923.1> ** Generic server rabbit_epmd_monitor terminating
** Last message in was check
** When Server state == {state,{erlang,#Ref<0.933050920.3710386178.8766>},erl_epmd,"rabbitmq","rabbitmq3",25672}
** Reason for termination ==
** {{case_clause,{error,nxdomain}},[{rabbit_epmd_monitor,check_epmd,1,[{file,"src/rabbit_epmd_monitor.erl"},{line,100}]},{rabbit_epmd_monitor,handle_info,2,[{file,"src/rabbit_epmd_monitor.erl"},{line,79}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,637}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,711}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}
2019-10-02 23:29:03.236 [error] <0.15923.1> CRASH REPORT Process rabbit_epmd_monitor with 0 neighbours crashed with reason: no case clause matching {error,nxdomain} in rabbit_epmd_monitor:check_epmd/1 line 100
2019-10-02 23:29:03.237 [error] <0.1300.0> Supervisor rabbit_epmd_monitor_sup had child rabbit_epmd_monitor started with rabbit_epmd_monitor:start_link() at <0.15923.1> exit with reason no case clause matching {error,nxdomain} in rabbit_epmd_monitor:check_epmd/1 line 100 in context child_terminated

Nodes are running in a Docker Swarm using the latest Docker official version (3.7.18 at the time of this writing). Versions are the following:

 {running_applications,
     [{rabbitmq_management,"RabbitMQ Management Console","3.7.18"},
      {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.7.18"},
      {rabbitmq_management_agent,"RabbitMQ Management Agent","3.7.18"},
      {rabbit,"RabbitMQ","3.7.18"},
      {amqp_client,"RabbitMQ AMQP Client","3.7.18"},
      {rabbit_common,
          "Modules shared by rabbitmq-server and rabbitmq-erlang-client",
          "3.7.18"},
      {cowboy,"Small, fast, modern HTTP server.","2.6.1"},
      {cowlib,"Support library for manipulating Web protocols.","2.7.0"},
      {mnesia,"MNESIA  CXC 138 12","4.16.1"},
      {os_mon,"CPO  CXC 138 46","2.5.1"},
      {sysmon_handler,"Rate-limiting system_monitor event handler","1.1.0"},
      {lager,"Erlang logging framework","3.8.0"},
      {ranch,"Socket acceptor pool for TCP protocols.","1.7.1"},
      {ssl,"Erlang/OTP SSL application","9.4"},
      {public_key,"Public key infrastructure","1.7"},
      {credentials_obfuscation,
          "Helper library that obfuscates sensitive values in process state",
          "1.1.0"},
      {crypto,"CRYPTO","4.6"},
      {asn1,"The Erlang ASN1 compiler version 5.0.9","5.0.9"},
      {tools,"DEVTOOLS  CXC 138 16","3.2.1"},
      {observer_cli,"Visualize Erlang Nodes On The Command Line","1.5.2"},
      {recon,"Diagnostic tools for production use","2.5.0"},
      {stdout_formatter,
          "Tools to format paragraphs, lists and tables as plain text",
          "0.2.2"},
      {xmerl,"XML parser","1.3.22"},
      {jsx,"a streaming, evented json parsing toolkit","2.9.0"},
      {inets,"INETS  CXC 138 49","7.1"},
      {goldrush,"Erlang event stream processor","0.1.9"},
      {compiler,"ERTS  CXC 138 10","7.4.6"},
      {syntax_tools,"Syntax tools","2.2.1"},
      {sasl,"SASL  CXC 138 11","3.4.1"},
      {stdlib,"ERTS  CXC 138 10","3.10"},
      {kernel,"ERTS  CXC 138 10","6.5"}]},
 {os,{unix,linux}},
 {erlang_version,
     "Erlang/OTP 22 [erts-10.5.1] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:64]\n"},

Results of rabbitmq-collect-env of the 3 nodes are attached. rabbitmq-env.zip

michaelklishin commented 4 years ago

Thank you for your time.

Team RabbitMQ uses GitHub issues for specific actionable items engineers can work on. GitHub issues are not used for questions, investigations, root cause analysis, discussions of potential issues, etc (as defined by this team).

We get at least a dozen of questions through various venues every single day, often light on details. At that rate GitHub issues can very quickly turn into a something impossible to navigate and make sense of even for our team. Because GitHub is a tool our team uses heavily nearly every day, the signal/noise ratio of issues is something we care about a lot.

Please post this to rabbitmq-users.

Thank you.

michaelklishin commented 4 years ago

According to this stack trace an internal component responsible for continuous re-registration of the node with epmd fails with nxdomain. nxdomain is UNIX speak for "domain cannot be resolved". Something on this host prevents the node from resolving its own domain.

michaelklishin commented 4 years ago

I don't see any evidence that this exception could affect your applications. You are welcome to share more logs on the mailing list. This component is a period background operation that is important for CLI tool and inter-node communication but is not at all related to client connections.

Most likely hostname resolution failed for the entire system, both epmd and clients, which is why they could not connect. Full logs of all nodes likely can confirm this hypothesis or at least provide extra clues.

michaelklishin commented 4 years ago

rabbitmq-users thread.

rabbitmq / rabbitmq-management

rabbit_epmd_monitor:check_epmd/1 fails with {error,nxdomain} #744