rabbitmq / rabbitmq-federation

RabbitMQ Federation plugin
https://www.rabbitmq.com/
Other
40 stars 21 forks source link

Federation links that fail to connect with a timeout leak direct connection and channel processes #119

Closed Ayanda-D closed 4 years ago

Ayanda-D commented 4 years ago

The Federation Plugin has a process leak which manifests on failed upstream connection attempts due to AMQP client timeouts on the connecting/downstream node.

On long running nodes, with high uptime, e.g. months or years, these have potential to eventually take down a node on reaching or exceeding the node's Erlang process limit.

Very easy to reproduce:

  1. Setup a federation link across 2-nodes
  2. Set an IPTABLE rule to block the downstream node from the upstream
  3. Wait and observe the process count, downstream connections and channels continuously increase periodically every minute

What's occurring is, the federation link process on start-up does an AMQP client call to connect to the upstream and continuously times-out after 60s and throwing an exception which (currently goes un-caught). During this time, the link would've created a local downstream connection and channel, which it ultimately does not close, leading to a periodic rise in connections (1 per minute), channels (1 per minute) and Erlang Process Count (approx. 12 per minute). The lower the AMQP Client Timeout, the faster the Erlang Process Count can exhaust the node's process limit, which can ultimately lead to a complete node crash. Default AMQP client call timeout is 60s.

This problem was also reported about a year ago: https://groups.google.com/g/rabbitmq-users/c/VmMnp2pIBvE/m/KEGnfIA8AgAJ

Timeouts in the Erlang AMQP Client have been around for a while now, so this issue has been sitting in here for a couple of years (upgrades probably needed for federation plugin users).

Connection, channel and process leaks manifest as follows on tests:

1. Leaking Processes ( ~25k )

image

2. Leaking Connections (~2k )

image

3. Leaking Channels (~2k )

image

Types of Changes

What types of changes does your code introduce to this project? Put an x in the boxes that apply

Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask on the mailing list. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

Further Comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution you did and what alternatives you considered, etc.

michaelklishin commented 3 years ago

Backported to v3.8.x.