Closed aldobongio closed 4 years ago
Thank you for your time.
Team RabbitMQ uses GitHub issues for specific actionable items engineers can work on. GitHub issues are not used for questions, investigations, root cause analysis, discussions of potential issues, etc (as defined by this team).
We get at least a dozen of questions through various venues every single day, often light on details. At that rate GitHub issues can very quickly turn into a something impossible to navigate and make sense of even for our team. Because GitHub is a tool our team uses heavily nearly every day, the signal/noise ratio of issues is something we care about a lot.
Please post this to rabbitmq-users.
Thank you.
According to this stack trace an internal component responsible for continuous re-registration of the node with epmd fails with nxdomain
. nxdomain
is UNIX speak for "domain cannot be resolved". Something on this host prevents the node from resolving its own domain.
I don't see any evidence that this exception could affect your applications. You are welcome to share more logs on the mailing list. This component is a period background operation that is important for CLI tool and inter-node communication but is not at all related to client connections.
Most likely hostname resolution failed for the entire system, both epmd
and clients, which is why they could not connect. Full logs of all nodes likely can confirm this hypothesis or at least provide extra clues.
In a cluster of 3 instances our Grafana/Prometheus monitoring system detected all the nodes down for 3-4 minutes, in two distinct moments. In such intervals client applications (SpringBoot apps) had issues connecting to RabbitMQ. The cluster was created and launched 26 hours before the error events and worked without problems until the events. After each error event the clustered recovered automatically without any manual intervention, but we would like to understand what happened and the causes.
The monitoring system reported the following during the downtime:
The logs of nodes 1 and 2 for the entire day were empty, whereas the node 3 had the following logs:
Nodes are running in a Docker Swarm using the latest Docker official version (3.7.18 at the time of this writing). Versions are the following:
Results of rabbitmq-collect-env of the 3 nodes are attached. rabbitmq-env.zip