The Federation Plugin has a process leak which manifests
on failed upstream connection attempts due to AMQP client timeouts
on the connecting/downstream node.
On long running nodes, with high uptime, e.g. months or years, these
have potential to eventually take down a node on reaching or exceeding
the node's Erlang process limit.
Very easy to reproduce:
Setup a federation link across 2-nodes
Set an IPTABLE rule to block the downstream node from the upstream
Wait and observe the process count, downstream connections and
channels continuously increase periodically every minute
What's occurring is, the federation link process on start-up does an
AMQP client call to connect to the upstream and continuously times-out
after 60s and throwing an exception which (currently goes un-caught).
During this time, the link would've created a local downstream connection
and channel, which it ultimately does not close, leading to a periodic rise
in connections (1 per minute), channels (1 per minute) and Erlang Process
Count (approx. 12 per minute). The lower the AMQP Client Timeout, the
faster the Erlang Process Count can exhaust the node's process limit,
which can ultimately lead to a complete node crash. Default AMQP
client call timeout is 60s.
Timeouts in the Erlang AMQP Client have been around for a while now,
so this issue has been sitting in here for a couple of years (upgrades
probably needed for federation plugin users).
Connection, channel and process leaks manifest as follows on tests:
1. Leaking Processes ( ~25k )
2. Leaking Connections (~2k )
3. Leaking Channels (~2k )
Types of Changes
What types of changes does your code introduce to this project?
Put an x in the boxes that apply
[x] Bugfix (non-breaking change which fixes issue #NNNN)
[ ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] Documentation (correction or otherwise)
[ ] Cosmetics (whitespace, appearance)
Checklist
Put an x in the boxes that apply. You can also fill these out after creating
the PR. If you're unsure about any of them, don't hesitate to ask on the
mailing list. We're here to help! This is simply a reminder of what we are
going to look for before merging your code.
[ ] I have added tests that prove my fix is effective or that my feature works
[ ] I have added necessary documentation (if appropriate)
[ ] Any dependent changes have been merged and published in related repositories
Further Comments
If this is a relatively large or complex change, kick off the discussion by
explaining why you chose the solution you did and what alternatives you
considered, etc.
The Federation Plugin has a process leak which manifests on failed upstream connection attempts due to AMQP client timeouts on the connecting/downstream node.
On long running nodes, with high uptime, e.g. months or years, these have potential to eventually take down a node on reaching or exceeding the node's Erlang process limit.
Very easy to reproduce:
What's occurring is, the federation link process on start-up does an AMQP client call to connect to the upstream and continuously times-out after 60s and throwing an exception which (currently goes un-caught). During this time, the link would've created a local downstream connection and channel, which it ultimately does not close, leading to a periodic rise in connections (1 per minute), channels (1 per minute) and Erlang Process Count (approx. 12 per minute). The lower the AMQP Client Timeout, the faster the Erlang Process Count can exhaust the node's process limit, which can ultimately lead to a complete node crash. Default AMQP client call timeout is 60s.
This problem was also reported about a year ago: https://groups.google.com/g/rabbitmq-users/c/VmMnp2pIBvE/m/KEGnfIA8AgAJ
Timeouts in the Erlang AMQP Client have been around for a while now, so this issue has been sitting in here for a couple of years (upgrades probably needed for federation plugin users).
Connection, channel and process leaks manifest as follows on tests:
1. Leaking Processes ( ~25k )
2. Leaking Connections (~2k )
3. Leaking Channels (~2k )
Types of Changes
What types of changes does your code introduce to this project? Put an
x
in the boxes that applyChecklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask on the mailing list. We're here to help! This is simply a reminder of what we are going to look for before merging your code.CONTRIBUTING.md
documentFurther Comments
If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution you did and what alternatives you considered, etc.