Nodes terminate without any clues in the log

jaco-terbraak commented 6 years ago

Whenever I start my client, which creates 100 queus, 100 exchanges, 100 consumers and 100 publishers, one or more nodes in the RabbitMQ cluster will inevitably crash within minutes. There seems to be nothing in the logs which points to what the problem is.

It happens at Google Kubernetes: screenshot 2018-10-26 at 12 07 54

But also locally on my Macbook Pro development machine.

I have configured the cluster with HA-mode: exactly 2 and master-locator: min-masters.

RabbitMQ server: 3.7.8 (dockerized; rabbitmq:3-management) Erlang: 20.3.8.5; Erlang/OTP 20 [erts-9.3.3.3] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:10] [hipe] [kernel-poll:false]

Operating system version: Mac OSX High Sierra; Mojave; Google Kubernetes platform

All client libraries used:

org.springframework.boot:spring-boot-starter-amqp:2.0.6.RELEASE
com.rabbitmq:http-client:3.0.0.RELEASE
org.apache.httpcomponents:httpclient:4.3.2

RabbitMQ plugins: [ ] rabbitmq_amqp1_0 3.7.8 [ ] rabbitmq_auth_backend_cache 3.7.8 [ ] rabbitmq_auth_backend_http 3.7.8 [ ] rabbitmq_auth_backend_ldap 3.7.8 [ ] rabbitmq_auth_mechanism_ssl 3.7.8 [ ] rabbitmq_consistent_hash_exchange 3.7.8 [ ] rabbitmq_event_exchange 3.7.8 [ ] rabbitmq_federation 3.7.8 [ ] rabbitmq_federation_management 3.7.8 [ ] rabbitmq_jms_topic_exchange 3.7.8 [E] rabbitmq_management 3.7.8 [e] rabbitmq_management_agent 3.7.8 [ ] rabbitmq_mqtt 3.7.8 [ ] rabbitmq_peer_discovery_aws 3.7.8 [ ] rabbitmq_peer_discovery_common 3.7.8 [ ] rabbitmq_peer_discovery_consul 3.7.8 [ ] rabbitmq_peer_discovery_etcd 3.7.8 [ ] rabbitmq_peer_discovery_k8s 3.7.8 [ ] rabbitmq_random_exchange 3.7.8 [ ] rabbitmq_recent_history_exchange 3.7.8 [ ] rabbitmq_sharding 3.7.8 [ ] rabbitmq_shovel 3.7.8 [ ] rabbitmq_shovel_management 3.7.8 [ ] rabbitmq_stomp 3.7.8 [ ] rabbitmq_top 3.7.8 [ ] rabbitmq_tracing 3.7.8 [ ] rabbitmq_trust_store 3.7.8 [e*] rabbitmq_web_dispatch 3.7.8 [ ] rabbitmq_web_mqtt 3.7.8 [ ] rabbitmq_web_mqtt_examples 3.7.8 [ ] rabbitmq_web_stomp 3.7.8 [ ] rabbitmq_web_stomp_examples 3.7.8

Client log: client.log.txt client.producer.log.txt

Server log: /var/log/rabbitmq/log/crash.log is empty server.log.txt

Crash dump (too large for Github): https://files.fm/f/psz5pjdn

I don't seem to have the rabbitmq-collect-env script available in the docker image. However here is the output of rabbitmq-diagnostics report, after the crashed node has been restarted: diag-report.txt

Here is an archive of the client application I used: rabbitmq-test.zip

Please let me know how I can provide additional information.

michaelklishin commented 6 years ago

Thank you for your time.

Team RabbitMQ uses GitHub issues for specific actionable items engineers can work on. GitHub issues are not used for questions, investigations, root cause analysis, discussions of potential issues, etc (as defined by this team).

We get at least a dozen of questions through various venues every single day, often light on details. At that rate GitHub issues can very quickly turn into a something impossible to navigate and make sense of even for our team. Because GitHub is a tool our team uses heavily nearly every day, the signal/noise ratio of issues is something we care about a lot.

Please post this to rabbitmq-users.

Thank you.

michaelklishin commented 6 years ago

rabbitmq-collect-env is in a separate repo.

Monitoring and inspecting system logs should be your next steps. There is not much else in the except for

A lot of connections being opened and very few closed (but some were closed unexpectedly, so chances are applications terminated as well)
An erl_child_setup closed message which means a subprocess unexpectedly failed

There's a good chance this is nothing other than the out of memory killer or similar (there are also no alarms in the logs, however).

michaelklishin commented 6 years ago

Your Erlang build has HiPE enabled but not kernel polling. That's an unusual combination, at least for Linux. Consider using OTP 21 (where polling cannot be disabled since the I/O subsystem is significantly different and assumes kernel polling API availability) and without HiPE. HiPE has routinely caused obscure runtime segfaults in the pre-18.x days, for example.

kitsirota commented 6 years ago

@jaco-terbraak we had to set 10 minute minimum wait for checking if the node is healthy in 5 node clusters with HiPE enabled. Typically HiPE adds about 5-8 mins to boot times for us. The process will be running, and in the bosh release, the monit job will display running, but in the logs you should see a progress bar for HiPE compilation.

Once HiPE is compiled, it can take another 1-2 mins for the node to sync up and join the cluster (larger deployments can take up to 15mins to sync up on the last node).

In practice, any kind of maintenance on our clusters goes like this. Bosh (PCF instance orchestration layer) takes 1 node down and waits 10 mins after the process is running before checking if its healthy and proceeding to the next node.

If youre monitoring a specific vhost you can also have your automation curl against <rmq-worker-ip>:15672/api/vhosts/<vhost-name> and grep for "failed" (I dont remember if its failed, or down). To be sure you youre deployment is complete, you'll want to either wait for every vhost supervisor to be running or you could manually trigger a vhost supervisor restart on a node via the HTTP API.

michaelklishin commented 6 years ago

That's a good point. If HiPE is enabled for RabbitMQ it will also incur a several minute long startup penalty which is taken by compiling most RabbitMQ modules with HiPE.

Let's continue on rabbitmq-users.

jaco-terbraak commented 6 years ago

Thanks all for taking the time to look into this. I've posted a continuation thread here: https://groups.google.com/forum/#!topic/rabbitmq-users/XqHRTxfVVe0

rabbitmq / rabbitmq-server

Nodes terminate without any clues in the log #1750