Closed dougbarth closed 12 years ago
"The first issue is that our consumers do not resubscribe on their second reconnection" — yes, and this is by design. Automatic recovery may eventually materialize but recovery process is so application dependent that it is very likely that it will do more harm than good.
You are looking at AMQP 1.0. RabbitMQ and amqp gem implement AMQP 0.9.1 (in case of amqp gem 0.7.1, AMQP 0.8).
Although 0.9.1 spec has to say the same thing:
Any sent octet is a valid substitute for a heartbeat, thus heartbeats only have to be sent if no non-heartbeat AMQP traffic
is sent for longer than one heartbeat interval. If a peer detects no incoming traffic (i.e. received octets) for two heartbeat
intervals or longer, it should close the connection without following the Connection.Close/Close-Ok handshaking, and
log an error.
(Section 4.2.7).
There is one sad downside to treating every frame as a heartbeat: Time.now is slow. But I have some ideas.
I just posted error handling & recovery feature plan for amqp gem 0.8.0.RC14 on the mailing list, please join the conversation.
"There is one sad downside to treating every frame as a heartbeat: Time.now is slow. But I have some ideas"
Hey Michael,
Instead of defined the heartbeat in terms of time, could you just define it in terms of missed heartbeats? It looks like the behavior is 2 missed heartbeats results in a reconnect. When a heartbeat message is sent, you decrement the value. If the value is zero, you reconnect. Upon receiving a message, you reset the heartbeat counter to 2.
Apologizes if that's the plan you're already working on.
Doug,
Good idea. Your suggestion is close to what I was thinking. Thanks for feedback.
One more update, looks like we have a pretty good solution for automatic recovery in place now. But it needs a lot of testing.
Hi,
We're fighting with some weird connection issues in our production environment. During the investigation, we found two issues that I think are bugs in the amqp library. I know the library is in a state of flux for the 0.8 release, but I thought I'd report these either way.
We're using amqp 0.7.1 and RabbitMQ 2.4.0.
The first issue is that our consumers do not resubscribe on their second reconnection. After reconnecting the first time, they subscribe to the queues they were subscribed to, but the second time, they just establish their connection and sit idle.
The second issue is that with heartbeat-ing enabled, they misreport a dead connection if they're actually doing work. When looking at the verbose connection logging, I see a hearbeat frame being sent, but no frame being returned. When the connection is idle, a heartbeat frame is returned immediately. I ran across a FAQ on the AMQP site that mentioned that a client should consider any frame received as a heartbeat frame. It appears that RabbitMQ is following this behavior and skipping sending a response if a delivery is being sent.
http://www.amqp.org/confluence/display/AMQP/AMQP+1.0+Implementation+FAQ#AMQP1.0ImplementationFAQ-Q%3ADoheartbeatsalwayshavetobeonchannelzero%3F