Direct connections on downstream not closing when upstream connection is closed

ricardobcl commented 6 years ago

The issue happens with federated exchanges, where there is one "normal" (aka TCP) connection on the upstream node and one "direct" (aka Erlang message-passing) connection on the downstream node (per federation link).

If I do a "Force Close" in the management UI on the connection on the upstream node, the connection quickly recovers and comes back. But on the downstream, there are now 2 direct connections.

This happens every time. If I have 1000 federated exchanges, I have 1000 connections in the upstream and another 1000 in the downstream. After killing all connections in the upstream, I have 2000 direct connections in the downstream. Also, the number of processes quickly goes up. By my quick observations, every direct connection introduced 10-20 new processes. Memory also increases.

This thread basically seems to describe this problem: https://groups.google.com/forum/#!searchin/rabbitmq-users/erlang$20processes%7Csort:date/rabbitmq-users/jfQN4XmhDes/csRuOiqdCwAJ I know at least another case where this happens (it eventually causes the node to hit the process count limit).

The cause seems to be that the process handling the federation link on the downstream server crashes when it's terminating before closing the direct connection. I have a PR ready that solves this issue, according to my local testing.

I did my testing locally, setting up two docker containers using the image rabbitmq:3.7.5-management . The issue is pretty easy to reproduce but if you want any information, let me know.

michaelklishin commented 6 years ago

Thank you for the PR. It looks reasonable. How can we reproduce? Simply cut traffic off between the downstream and upstream nodes e.g. using iptables?

ricardobcl commented 6 years ago

The most simple case I tested was to just close the connection on the upstream's management UI with "Force Close".

I also tested with closing 100 and 1000 federated exchange links with a plugin that kills connections (killing them almost simultaneously on the upstream node).

Another test I did was with "docker stop" and "docker kill" on the upstream and both cause the same issue: the original downstream direct connection survived and when the upstream node came back, a new additional direct connection was created.

I didn't test with iptables, but I assume that something similar will happen.

To be clear, if I close the downstream connection (direct connection) first, everything is fine (no duplicate/zombie connections).

I think the issue was that on terminate/2 on rabbit_federation_exchange_link.erl, because it tried to clean some state before closing the TCP connection and that code path blow up trying to create a channel on a closing/closed connection, the direct connection was never closed on that terminate/2 function.

Also, because the direct connection is not a TCP connection, it didn't matter the TCP timeout settings used on RabbitMQ and OS, so it survived forever.

michaelklishin commented 6 years ago

Your hypothesis is plausible but some comments can be misleading. Direct connections rely on inter-node communication connections between nodes, which are TCP connections with a separate peer unavailability detection mechanism. However, direct connections can be established to to the node where the link is running, in which case it is not a TCP connection (processes communicate directly).

We will release a 3.7.7-rc.1 and then QA this PR, so it has a chance of getting itno 3.7.7 final. Thanks again!

ricardobcl commented 6 years ago

Yes you're right, I was so focused on the 1-node cluster test that I was running (so obviously the direct connection is local and doesn't use TCP), but in a cluster that's not always true.

Thanks for clarifying so that other people reading this do not get the wrong idea.

michaelklishin commented 6 years ago

I could reproduce the issue:

Start two independent 3.7.6 or 3.7.7-beta.2 nodes with rabbitmq_management, rabbitmq_federation and rabbitmq_federation_management plugins enabled, e.g.

# node 1
 ./sbin/rabbitmq-server

# node 2
RABBITMQ_ALLOW_INPUT=1 RABBITMQ_NODE_PORT=5673 RABBITMQ_NODENAME="hare@warp10" RABBITMQ_SERVER_START_ARGS="-rabbitmq_management listener [{port,15673}]" ./sbin/rabbitmq-server

# the enable the plugins as usual

Set up the upstream and policy for exchange federation

./sbin/rabbitmqctl set_parameter federation-upstream up1 '{"uri":"amqp://localhost:5673"}'

./sbin/rabbitmqctl set_policy --apply-to exchanges x-federation "federated.*" '{"federation-upstream":"up1"}'

Declare an exchange of any type named federated.x1
Go to the management UI on both nodes in 2 tabs, observe connections on both nodes
On the upstream node with port 5673 force close the network connection
Observe its recovery
On the downstream node (port 5672) observe that the number of direct connections grows every time you repeat step 5 and connection recovers (so, step 6 is repeated by the plugin)

michaelklishin commented 6 years ago

The proposed changes in #77 are reasonable and effective according to my tests 👍👍 .

selimt commented 6 years ago

We are still seeing connections that are not closed after we had a network outage. See the screenshot below:

michaelklishin commented 6 years ago

@selimt federation links restart after network failures. Sorry but that's no evidence that this specific issue is still present. Several users plus the reporter confirmed that it is fixed. This is mailing list material.

selimt commented 6 years ago

Apologies, I'll post my issue there. -Selim

michaelklishin commented 6 years ago

Follow-up rabbitmq-users thread.

rabbitmq / rabbitmq-federation

Direct connections on downstream not closing when upstream connection is closed #76