ruby-amqp / amqp

EventMachine-based RabbitMQ client. Prefer Bunny: http://rubybunny.info. See documentation guides at http://ruby-amqp.github.io/amqp/.
http://ruby-amqp.github.io/amqp/
634 stars 143 forks source link

Treat every incoming frame as a heartbeat #84

Closed dougbarth closed 12 years ago

dougbarth commented 13 years ago

Hi,

We're fighting with some weird connection issues in our production environment. During the investigation, we found two issues that I think are bugs in the amqp library. I know the library is in a state of flux for the 0.8 release, but I thought I'd report these either way.

We're using amqp 0.7.1 and RabbitMQ 2.4.0.

The first issue is that our consumers do not resubscribe on their second reconnection. After reconnecting the first time, they subscribe to the queues they were subscribed to, but the second time, they just establish their connection and sit idle.

The second issue is that with heartbeat-ing enabled, they misreport a dead connection if they're actually doing work. When looking at the verbose connection logging, I see a hearbeat frame being sent, but no frame being returned. When the connection is idle, a heartbeat frame is returned immediately. I ran across a FAQ on the AMQP site that mentioned that a client should consider any frame received as a heartbeat frame. It appears that RabbitMQ is following this behavior and skipping sending a response if a delivery is being sent.

http://www.amqp.org/confluence/display/AMQP/AMQP+1.0+Implementation+FAQ#AMQP1.0ImplementationFAQ-Q%3ADoheartbeatsalwayshavetobeonchannelzero%3F

michaelklishin commented 13 years ago

"The first issue is that our consumers do not resubscribe on their second reconnection" — yes, and this is by design. Automatic recovery may eventually materialize but recovery process is so application dependent that it is very likely that it will do more harm than good.

michaelklishin commented 13 years ago

You are looking at AMQP 1.0. RabbitMQ and amqp gem implement AMQP 0.9.1 (in case of amqp gem 0.7.1, AMQP 0.8).

Although 0.9.1 spec has to say the same thing:

Any sent octet is a valid substitute for a heartbeat, thus heartbeats only have to be sent if no non-heartbeat AMQP traffic 
is sent for longer than one heartbeat interval. If a peer detects no incoming traffic (i.e. received octets) for two heartbeat 
intervals or longer, it should close the connection without following the Connection.Close/Close-Ok handshaking, and 
log an error.

(Section 4.2.7).

There is one sad downside to treating every frame as a heartbeat: Time.now is slow. But I have some ideas.

michaelklishin commented 13 years ago

I just posted error handling & recovery feature plan for amqp gem 0.8.0.RC14 on the mailing list, please join the conversation.

dougbarth commented 13 years ago

"There is one sad downside to treating every frame as a heartbeat: Time.now is slow. But I have some ideas"

Hey Michael,

Instead of defined the heartbeat in terms of time, could you just define it in terms of missed heartbeats? It looks like the behavior is 2 missed heartbeats results in a reconnect. When a heartbeat message is sent, you decrement the value. If the value is zero, you reconnect. Upon receiving a message, you reset the heartbeat counter to 2.

Apologizes if that's the plan you're already working on.

michaelklishin commented 13 years ago

Doug,

Good idea. Your suggestion is close to what I was thinking. Thanks for feedback.

michaelklishin commented 13 years ago

One more update, looks like we have a pretty good solution for automatic recovery in place now. But it needs a lot of testing.