Closed ricardobcl closed 6 years ago
Thank you for the PR. It looks reasonable. How can we reproduce? Simply cut traffic off between the downstream and upstream nodes e.g. using iptables?
The most simple case I tested was to just close the connection on the upstream's management UI with "Force Close".
I also tested with closing 100 and 1000 federated exchange links with a plugin that kills connections (killing them almost simultaneously on the upstream node).
Another test I did was with "docker stop" and "docker kill" on the upstream and both cause the same issue: the original downstream direct connection survived and when the upstream node came back, a new additional direct connection was created.
I didn't test with iptables, but I assume that something similar will happen.
To be clear, if I close the downstream connection (direct connection) first, everything is fine (no duplicate/zombie connections).
I think the issue was that on terminate/2 on rabbit_federation_exchange_link.erl, because it tried to clean some state before closing the TCP connection and that code path blow up trying to create a channel on a closing/closed connection, the direct connection was never closed on that terminate/2 function.
Also, because the direct connection is not a TCP connection, it didn't matter the TCP timeout settings used on RabbitMQ and OS, so it survived forever.
Your hypothesis is plausible but some comments can be misleading. Direct connections rely on inter-node communication connections between nodes, which are TCP connections with a separate peer unavailability detection mechanism. However, direct connections can be established to to the node where the link is running, in which case it is not a TCP connection (processes communicate directly).
We will release a 3.7.7-rc.1 and then QA this PR, so it has a chance of getting itno 3.7.7 final. Thanks again!
Yes you're right, I was so focused on the 1-node cluster test that I was running (so obviously the direct connection is local and doesn't use TCP), but in a cluster that's not always true.
Thanks for clarifying so that other people reading this do not get the wrong idea.
I could reproduce the issue:
rabbitmq_management
, rabbitmq_federation
and rabbitmq_federation_management
plugins enabled, e.g.# node 1
./sbin/rabbitmq-server
# node 2
RABBITMQ_ALLOW_INPUT=1 RABBITMQ_NODE_PORT=5673 RABBITMQ_NODENAME="hare@warp10" RABBITMQ_SERVER_START_ARGS="-rabbitmq_management listener [{port,15673}]" ./sbin/rabbitmq-server
# the enable the plugins as usual
./sbin/rabbitmqctl set_parameter federation-upstream up1 '{"uri":"amqp://localhost:5673"}'
./sbin/rabbitmqctl set_policy --apply-to exchanges x-federation "federated.*" '{"federation-upstream":"up1"}'
federated.x1
The proposed changes in #77 are reasonable and effective according to my tests 👍👍 .
We are still seeing connections that are not closed after we had a network outage. See the screenshot below:
@selimt federation links restart after network failures. Sorry but that's no evidence that this specific issue is still present. Several users plus the reporter confirmed that it is fixed. This is mailing list material.
Apologies, I'll post my issue there. -Selim
Follow-up rabbitmq-users thread.
The issue happens with federated exchanges, where there is one "normal" (aka TCP) connection on the upstream node and one "direct" (aka Erlang message-passing) connection on the downstream node (per federation link).
If I do a "Force Close" in the management UI on the connection on the upstream node, the connection quickly recovers and comes back. But on the downstream, there are now 2 direct connections.
This happens every time. If I have 1000 federated exchanges, I have 1000 connections in the upstream and another 1000 in the downstream. After killing all connections in the upstream, I have 2000 direct connections in the downstream. Also, the number of processes quickly goes up. By my quick observations, every direct connection introduced 10-20 new processes. Memory also increases.
This thread basically seems to describe this problem: https://groups.google.com/forum/#!searchin/rabbitmq-users/erlang$20processes%7Csort:date/rabbitmq-users/jfQN4XmhDes/csRuOiqdCwAJ I know at least another case where this happens (it eventually causes the node to hit the process count limit).
The cause seems to be that the process handling the federation link on the downstream server crashes when it's terminating before closing the direct connection. I have a PR ready that solves this issue, according to my local testing.
I did my testing locally, setting up two docker containers using the image rabbitmq:3.7.5-management . The issue is pretty easy to reproduce but if you want any information, let me know.