Closed dch closed 5 years ago
Thank you for the detailed analysis. You haven't specified what version of RabbitMQ was used but my guess is that it was 3.7.8 at most. One part of your findings — the server not always using an interval computed from the negotiated value — seems to be addressed by https://github.com/rabbitmq/rabbitmq-common/pull/278, shipped in 3.7.9 two days ago.
The other part in this client matches https://github.com/rabbitmq/rabbitmq-java-client/blame/master/src/main/java/com/rabbitmq/client/impl/AMQConnection.java#L579, which has been around for 11 years and we are not going to change.
In most environments (CloudAMQP is an exception), users who set heartbeats to 0
do not do so intentionally or do not necessarily understand the implications. If you use TCP keepalives then you have two options:
0
.I am updating the docs to clarify the negotiation process and how heartbeats can be disabled. This helped me go through the negotiation code one more time and I agree that the client code never worked as the docs (as of 3.0) claim it to be. That sucks but breaking client changes are a painful process to go through since there are so many of them and RabbitMQ users have come to expect any reasonably recent client version to work with any reasonably recent server version. So our compatibility story with clients is both a blessing and a curse :(
Given that most users do not want heartbeats disabled, I still lean towards "we won't change the negotiation process". It's just not a commonly enough reported issue.
If you use a dedicated plan CloudAMQP support likely can change the server config for you.
I'll ask other members of our team to take a look and share their opinion.
thanks Michael. For reference the client here is the erlang one and I’ve tested this against both 3.6.12, .14 and my 3.7.8 install. I agree that changing the server side is not an option and I’m fine with heartbeats in erlang and other clients that can support it. They’re there for a good reason.
To clarify: the server doesn't seem to be the problematic bit to me. It's the client logic that we tried to unify across all clients (the Java client is used here because it is the reference implementation for us) that negotiates to a max or min value, and never 0 unless both values are 0.
As far as the protocol goes, in AMQP 0-9-1 it's the client that has the final call.
never 0 unless both values are 0
👍 to this
issue
An erlang client (or elixir that depends on this module) cannot be configured to run with disabled amqp-level heartbeats, and always accepts the default sent by the server - e.g. as recommended by cloudamqp. NB whether that is a good choice is not the intended purpose of this issue ;-)
In addition, I speculate that this same bug in the erlang client, causes the heartbeat sender on the server side to be disabled (as the client requested), but the erlang gen_server on the client side is incorrectly waiting the negotiated 2xHB duration for heartbeats inside rabbitmq-common, that cascades a timeout & subsequent socket termination, as the awaited heartbeat never comes.
rabbitmq docs
expected behaviour
Connecting to
amqp://..?heartbeat=37
does show expected behaviour - client & server agree on the heartbeat frequency - as seen via tcpdump & validated in clientamqp_connection:info(Conn, [heartbeat]).
for client, and server in connection web ui.Connecting to
amqp://..?heartbeat=0
should end up with heartbeat=0 on both client & in rabbitmq web ui.unexpected behaviour
Actually,
amqp://..?heartbeat=0
does not end up with a configured HB=0, but only inherits whatever the server has recommended.In addition, despite the wire protocol negotiation being correct, the actual heartbeats are not sent at the expected 1/2 heartbeat time, (presumably as the requested 0 HB setting is propagated into rabbitmq-common's heartbeat timer on the server side, even though this doesn't match what has been agreed on the wire).
Thus, from the client's perspective, after the server has not sent any heartbeats, the gen_server hits a timeout, and the erlang client then cleans up, requesting to close the TCP connection and the server responds in kind.
possible bug
https://github.com/rabbitmq/rabbitmq-erlang-client/blob/master/src/amqp_network_connection.erl#L236-L243 looks like it makes incorrect comparisons in this specific case:
I think we should see
[0,0,0]
here, for example.speculation
As mentioned above, I think this is therefore wrong on both server and client for RabbitMQ as the library is shared in both systems. In addition, I think this incorrect setting propagates further, into
rabbit_common:start_timer
, causing the negotiated AMQP heartbeats (which we didn't request!) not to be send from the server (as the amqp params has heartbeat=0 which doesn't match what is agreed on the wire, as confirmed by tcpdump), but the client is still expecting to receive a heartbeat, and when it doesn't receive one, the TCP connection is reset, which causes the entire connection to fail. I think this hits codepath here https://github.com/rabbitmq/rabbitmq-common/blob/master/src/rabbit_heartbeat.erl#L109-L113 with the 0 pattern match.This ultimately causes the client to send a FIN/ACK requesting to close the channel as the connection has not received the expected heartbeats from the server -- this is the "matching bug" on the server end causing problems, I think.
Anyway, I have a tcpdump that I can share privately if needed, email/irc is fine, but it's easy enough to reproduce as follows:
test setup
gmake deps app shell
tcpdump of succesful tune negotiation
tcpdump of final RST conversation at end of stream
client->server
server->client
erlang shell session showing bug and timeout
These are logs from client side of reproduced issue (with connection loss after 60s as cloudamqp has a 120s heartbeat default).