ruby-amqp / bunny

Bunny is a popular, easy to use, mature Ruby client for RabbitMQ
Other
1.39k stars 304 forks source link

HeartbeatSender raises Errno::EPIPE on main thread when pipe is closed #559

Closed sbonebrake closed 6 years ago

sbonebrake commented 6 years ago

Reproduced with rabbit 3.7.6 and bunny 2.11.0.

This happens when bunny is created with { automatically_recover: false, threaded: false }. After the connection is open, if at any point it disconnects during a heartbeat attempt it will raise the exception on the thread that initialized bunny.

Code for repro here: https://github.com/sbonebrake/bunny_repro/blob/master/heartbeat_failure_app.rb

D, [2018-08-10T16:17:43.231675 #72686] DEBUG -- : Sent protocol preamble
D, [2018-08-10T16:17:43.243196 #72686] DEBUG -- : Sent connection.start-ok
D, [2018-08-10T16:17:43.244105 #72686] DEBUG -- : Heartbeat interval negotiation: client = 3, server = 60, result = 3
I, [2018-08-10T16:17:43.244145 #72686]  INFO -- : Heartbeat interval used (in seconds): 3
D, [2018-08-10T16:17:43.244184 #72686] DEBUG -- : Will use socket read timeout of 6.6000000000000005
D, [2018-08-10T16:17:43.244243 #72686] DEBUG -- : Initializing channel ID allocator with channel_max = 2047
D, [2018-08-10T16:17:43.244370 #72686] DEBUG -- : Sent connection.tune-ok with heartbeat interval = 3, frame_max = 131072, channel_max = 2047
D, [2018-08-10T16:17:43.244441 #72686] DEBUG -- : Sent connection.open with vhost = /
D, [2018-08-10T16:17:43.245591 #72686] DEBUG -- : Initializing heartbeat sender...
D, [2018-08-10T16:17:43.245743 #72686] DEBUG -- : Allocated channel id: 1
D, [2018-08-10T16:17:43.247834 #72686] DEBUG -- : Session#handle_frame on 1: #<AMQ::Protocol::Channel::OpenOk:0x00007f9c9f28a8a0 @channel_id="">
I, [2018-08-10T16:17:43.247900 #72686]  INFO -- : Declaring Bob with ch.queue('Bob')...
D, [2018-08-10T16:17:43.248776 #72686] DEBUG -- : Session#handle_frame on 1: #<AMQ::Protocol::Queue::DeclareOk:0x00007f9c9f289950 @queue="Bob", @message_count=0, @consumer_count=0>
D, [2018-08-10T16:17:43.248835 #72686] DEBUG -- : Channel#handle_frame on channel 1: #<AMQ::Protocol::Queue::DeclareOk:0x00007f9c9f289950 @queue="Bob", @message_count=0, @consumer_count=0>
I, [2018-08-10T16:17:43.250230 #72686]  INFO -- : I'm and app working #0
D, [2018-08-10T16:17:43.649363 #72686] DEBUG -- : Sending a heartbeat, last activity time: 2018-08-10 16:17:43 -0700, interval (s): 0.4
D, [2018-08-10T16:17:44.050402 #72686] DEBUG -- : Sending a heartbeat, last activity time: 2018-08-10 16:17:43 -0700, interval (s): 0.4
E, [2018-08-10T16:17:44.052167 #72686] ERROR -- : Exception that killed this process: Bunny::NetworkFailure detected a network failure: Broken pipe 
heartbeat_failure_app.rb:37:in `sleep'
    heartbeat_failure_app.rb:37:in `block (2 levels) in <main>'
    heartbeat_failure_app.rb:29:in `each'
    heartbeat_failure_app.rb:29:in `block in <main>'
    /.rvm/gems/ruby-2.3.5/gems/toxiproxy-1.0.3/lib/toxiproxy.rb:141:in `down'
    /.rvm/gems/ruby-2.3.5/gems/toxiproxy-1.0.3/lib/toxiproxy/proxy_collection.rb:31:in `block (2 levels) in down'
    /.rvm/gems/ruby-2.3.5/gems/toxiproxy-1.0.3/lib/toxiproxy/proxy_collection.rb:32:in `down'
    heartbeat_failure_app.rb:28:in `<main>'
E, [2018-08-10T16:17:44.052261 #72686] ERROR -- : Exception cause that killed this process: Errno::EPIPE Broken pipe Broken pipe
        /.rvm/gems/ruby-2.3.5/gems/bunny-2.11.0/lib/bunny/transport.rb:162:in `write'
    /.rvm/gems/ruby-2.3.5/gems/bunny-2.11.0/lib/bunny/transport.rb:162:in `block in write_without_timeout'
    /.rvm/rubies/ruby-2.3.5/lib/ruby/2.3.0/monitor.rb:214:in `mon_synchronize'
    /.rvm/gems/ruby-2.3.5/gems/bunny-2.11.0/lib/bunny/transport.rb:162:in `write_without_timeout'
    /.rvm/gems/ruby-2.3.5/gems/bunny-2.11.0/lib/bunny/heartbeat_sender.rb:66:in `beat'
    /.rvm/gems/ruby-2.3.5/gems/bunny-2.11.0/lib/bunny/heartbeat_sender.rb:48:in `block in run'
    /.rvm/gems/ruby-2.3.5/gems/bunny-2.11.0/lib/bunny/heartbeat_sender.rb:47:in `loop'
    /.rvm/gems/ruby-2.3.5/gems/bunny-2.11.0/lib/bunny/heartbeat_sender.rb:47:in `run'

I'm not 100% certain what is intended to happen when a heartbeat fails. It seems the heartbeat thread should probably not use transport#write_without_timeout method that raises exceptions to the session thread.

michaelklishin commented 6 years ago

If you run with threaded: false you will get exceptions on the main thread. If you also use automatically_recover: false, there will be no attempts to recover from I/O exceptions. So with certain timing you can get a low level I/O exception. I am not sure what would be expected here but users who use threaded: false have explicitly opted in to get exceptions on the main thread.

michaelklishin commented 6 years ago

I thought I misinterpreted the issue and that an I/O exception is thrown when a connection is closed cleanly. That doesn't seem to be the case:

2.5.1 :001 > c = Bunny.new(threaded: false, automatically_recover: false); c.start
 => #<Bunny::Session:0x7fa19d9f3a58 guest@127.0.0.1:5672, vhost=/, addresses=[127.0.0.1:5672]>
2.5.1 :002 > c.close
 => true

Because of that and the fact that the example uses Toxiproxy I suspect that the exception only occurs when RabbitMQ's heartbeat mechanism detects the timeout and closes the TCP connection as it should. In that case it is expected that Bunny might get an I/O exception on a heartbeat (or any other) frame write and it is expected that with the settings used by the OP there will be an exception on the main thread (or rather, the thread that instantiates a Bunny::Session). That's what the person who contributed threaded: false support intended to happen.

sbonebrake commented 6 years ago

The “threaded” option doesn’t impact which thread the heartbeat thread raises its exception on. If this example was run with “threaded: true” option, the reader loop will error first and also raise an exception on the main session thread. It seems that when “automatic_recovery” is off, the reader loop or heartbeat will raise exceptions on the session thread when IO errors occur.

It looks like HeartbeatSender#run will never be able to rescue IOError/Exception because transport#write_without_timeout rescues and handles them with automatic_recover or raises them to the session thread.

michaelklishin commented 6 years ago

If automatic recovery is disabled, what else can Bunny do but to throw an exception? It was explicitly told not to recover.

sbonebrake commented 6 years ago

At most HeartbeatSener should mark the session as closed when there is an issue. Then the next attempt to use the Bunny will fail with a ConnectionClosedError. How can it be by design to have unreachable code in HeartbeatSender?

It seems the Java client HeartbeatSender doesn't even close the connection. It just swallows all IOExceptions in HeartBeatSender.java. I feel like that's what it should do in Ruby. HeartbeatSender should make a best effort to keep the connection active at a regular interval and swallow any exceptions that occur along the way. It looks like that's exactly what HeartbeatSender initially did https://github.com/ruby-amqp/bunny/commit/55d57e38a31ae787d8693fe6b6acbe0ccd515653

michaelklishin commented 6 years ago

The purpose of heartbeats is not to keep the connection alive, it is to detect unavailable peers quicker.

michaelklishin commented 6 years ago

If automatic recovery is disabled, throwing exceptions is the best option I could think of and I haven't changed my mind in quite a few years. If recovery is enabled, attempts to enable is the best course of action.

Heartbeat write failures could first close the connection. In case of automatic recovery it would still kick off a recovery attempt. When it is disabled, I don't know if it would be meaningfully different if it was an I/O exception or a ConnectionClosedError on the next frame write. The user kind of opted into handling any exceptions that may arise.

@acogoluegnes @lukebakken WDYT?

sbonebrake commented 6 years ago

I was referring to this section http://www.rabbitmq.com/heartbeats.html#tcp-proxies where it states that heartbeats can be used to generate network traffic to help with proxy's like HAProxy, but you're right, there are several uses.

I'm unsure how the user "opted in" to handling exceptions here? Because automatically_recover is false? As it is implemented right now, if automatically_recover: false is used then Bunny will interrupt your application with an exceptions any time after the connection is started. AFAIK there is no way to handle the exceptions on the main thread, so the only safe way to do this would be to put Bunny in its own thread.

conn = Bunny.new("amqp://guest:guest@localhost:56722",
  heartbeat_timeout: 10,
  automatically_recover: false,
  threaded: false)
conn.start
# Now any heartbeat IO issues can cause exceptions and interrupt your application
MyApp.do_non_rabbit_important_tasks()

automatically_recover: true isn't always desirable. Lets say I have an application that publishes messages to rabbitmq and there is a backup rabbitmq server it can publish to in case of failure. In this instance it would be ideal to have the publish to the downed server fail as quickly as possible so it can publish to the backup server. If automatically_recover is in enabled, the first publish will "sleep @network_recovery_interval" and then attempt to reopen the connection. It seems reasonable that we should be able to accomplish this with automatically_recover:false without "opting in" to get exceptions in my main application from heartbeat or reader_loop threads.

sbonebrake commented 6 years ago

If this issue is accepted, this would be a simple approach to fixing it IMO https://github.com/ruby-amqp/bunny/compare/master...sbonebrake:safe_heartbeats This would mean that heartbeats just make a best effort to send, and the starting and stopping would only be controlled from the session. I believe this brings it closer to the Java implementation. The reader_loop could also be modified to check @last_activity_time similar to how Java's NioLoop checks. This change would ensure the HeartbeatSender never throws an exception on the session thread which would make {automatically_recover: false, threaded: false} safe options to use. This discussion could be move to the mailing list if that's more appropriate.

michaelklishin commented 6 years ago

What I don't understand is how it would be different from any other exception thrown in the "session thread" (the thread the session was instantiated in?). If you don't want recovery, exceptions will be thrown on the caller thread (IIRC that's the term we used in the docs). Is an I/O exception not indicative enough?

Java client carries I/O exceptions as shutdown reason, so in that sense they are not hidden from the user.

sbonebrake commented 6 years ago

If I'm running threaded: false the only way to get an an exception on the session thread should be an an explicit call to a session method, which would allow me to set up basic exception handling.

Lets say this a rails app that publishes messages when it gets requests.

class MessageController < ActionController::Base
  def bunny
    return @bunny if @bunny
   @bunny = Bunny.new(host: "something", 
                       automatically_recover: false, 
                       threaded: false, 
                       heartbeat_interval: 30)
    @bunny.start
  end 

  def create
     begin
        bunny.queue("messages").publish(params[:message_content])
     rescue Bunny::NetworkFailure => e
        logger.log(e)
        render :nothing => true,  :status => :service_unavailable   
        @bunny = nil
     end
  end

  def index
     # Nothing Bunny related
     Messages.last(10)
  end
end

If I get a request to MessageController#create, it will kick off the bunny session. If the heartbeat fails any time after the create method, it will interrupt my rails app, either the running Unicorn process loop or possibly during the MessageController#index method. Using threaded: true has the same issue, but I can avoid using that so I'm not as concerned. This is not a theoretical issue, it causes really big problems when we tested bringing rabbits down in our staging environment:

E, [2018-08-07T05:08:04.211313 #15112] ERROR -- : listen loop error: detected a network failure: closed stream (Bunny::NetworkFailure)
E, [2018-08-07T05:08:04.211480 #15112] ERROR -- : /lib/unicorn/http_server.rb:678:in `select'
E, [2018-08-07T05:08:04.211524 #15112] ERROR -- : /lib/unicorn/http_server.rb:678:in `worker_loop'
E, [2018-08-07T05:08:04.211559 #15112] ERROR -- : /lib/unicorn/http_server.rb:508:in `spawn_missing_workers'
E, [2018-08-07T05:08:04.211592 #15112] ERROR -- : /lib/unicorn/http_server.rb:132:in `start'
E, [2018-08-07T05:08:04.211622 #15112] ERROR -- : /bin/unicorn:126:in `<top (required)>'
E, [2018-08-07T05:08:04.211654 #15112] ERROR -- : /bin/unicorn:22:in `load'
E, [2018-08-07T05:08:04.211683 #15112] ERROR -- : /bin/unicorn:22:in `<top (required)>'

Even if I wrapped the whole rails application in a rescue Bunny::NetworkFailure, the exception could come up in non-bunny MessageController#index and I would be forced to return a 403 or 500 there as well. As is, my only is to close the connection after every publish or disable heartbeats. Am I missing some other option combination that makes it safe to run like this?

sbonebrake commented 6 years ago

Have you had any time to think about this further? Maybe @acogoluegnes @lukebakken have thoughts on this?

lukebakken commented 6 years ago

I have been working on higher-priority items. When I get a couple hours free to really think about this issue, I'll check it out.

michaelklishin commented 6 years ago

@sbonebrake OK, I see what you are unhappy about: potentially unexpected exceptions. I think we can make Bunny heartbeat sender do what its Java client counterpart does. I cannot guess how many people would prefer the current behavior but we already double checked that the connection is marked as closed, so there's a reasonable degree of observability of failure in place.

sbonebrake commented 6 years ago

Thanks @michaelklishin . The suggested diff above is a "least effort" fix and isn't the cleanest, but if you'd like me to put it into a PR with tests I can. Unsure if you had other ideas about how to address it.

acogoluegnes commented 6 years ago

I confirm that Java client hearbeat implementation on the sending part is quite passive, as it sends heartbeat frames in the background and indeed ignores write errors. Missed heartbeats from the server are incremented and checked only when a reading operation times out. I think this implementation is consistent with the heartbeat spec. It could be more agressive in the way it detects failures, but failure detection is tricky in practice.

Not sure that heartbeat sending errors should bubble up to the application code though.

michaelklishin commented 6 years ago

@sbonebrake please submit a PR and we'll discuss a specific solution there. I don't know if there can be a neat way to silence some exceptions ;)

sbonebrake commented 6 years ago

This can be closed now. You can also get a similar problem if you run the spec "when the the connection detects missed heartbeats without automatic recovery" with threaded: true but the exception comes from the reader loop in that case. I'm not sure if that's intended for the same reasons listed above. It can be worked around with threaded false, so I'm happy.

michaelklishin commented 6 years ago

threaded: true doesn't make much sense to the DIY recovery crowd. The logic is not very different: if you don't want recovery, Bunny probably shouldn't swallow all exceptions, at least no in the I/O loop.