Sporadic disconnects (close events) at start up

tikh commented 8 years ago

I have an issue with starting up multiple bot clients after application start up. Sometimes, "CLOSE" event is being fired for many clients and regular reconnection logic doesn't help (while it works fine in event of temporary network disconnect). "CLOSE" events have "code" value 1006. It doesn't happen every time, and it have something to do with time spent in callbacks, I suppose.

A snippet used to reproduce the issue - https://gist.github.com/tikh/50d5cf34481330759a736aa351e3ee2e. Simply adding a "sleep" to "hello" callback makes the issue happen. But even without sleep this happens sometimes.

In the example, faye-websocket gem was used and 300 connections to different Slack teams.

dblock commented 8 years ago

This is consistent with the behavior observed with faye-websocket. We try to open a lot of connections, and the client can't process them fast enough, causing these close events, a reconnect and finally a success. With many connections it happens a lot more often, and can throw a service with 300 teams in a loop. It's more frequent if you make the initial connection under EM.defer, but that hits https://github.com/faye/websocket-driver-ruby/issues/38.

You definitely want to handle the close event and try to reconnect, like https://github.com/dblock/slack-ruby-bot-server/blob/8ba27e827cfcc75bf6d015e5117c12c72fa33096/lib/slack-ruby-bot-server/server.rb#L15.

I had switched to using celluloid-io for all of my projects, which doesn't do the rtm.start connection in parallel making it a little slower to start (see #87), but works very consistently.

tikh commented 8 years ago

The start up issue was resolved after switching to celluloid-io gem. However, I wanted to test a reconnection logic and found out that close event never happens (at least in a few first minutes). I tried two things - turning off wi-fi on my laptop and deleting an integration in Slack. I expected to have close event earlier as WebSockets ping interval is 30 seconds by default. Also, I tried to do active periodic pinging and got this exception:

[2016-06-06 12:03:13.276][PID=21374] ERROR Actor crashed!
Errno::ETIMEDOUT: Operation timed out
    /Users/tikh/.rvm/rubies/ruby-2.3.0/lib/ruby/2.3.0/openssl/buffering.rb:178:in `sysread_nonblock'
    /Users/tikh/.rvm/rubies/ruby-2.3.0/lib/ruby/2.3.0/openssl/buffering.rb:178:in `read_nonblock'
    /Users/tikh/.rvm/gems/ruby-2.3.0/gems/celluloid-io-0.17.3/lib/celluloid/io/stream.rb:42:in `block in sysread'
    /Users/tikh/.rvm/gems/ruby-2.3.0/gems/celluloid-io-0.17.3/lib/celluloid/io/stream.rb:390:in `synchronize'
    /Users/tikh/.rvm/gems/ruby-2.3.0/gems/celluloid-io-0.17.3/lib/celluloid/io/stream.rb:40:in `sysread'
    /Users/tikh/.rvm/gems/ruby-2.3.0/gems/celluloid-io-0.17.3/lib/celluloid/io/stream.rb:129:in `readpartial'
    /Users/tikh/.rvm/gems/ruby-2.3.0/gems/slack-ruby-client-0.7.2/lib/slack/real_time/concurrency/celluloid.rb:42:in `read'
    /Users/tikh/.rvm/gems/ruby-2.3.0/gems/slack-ruby-client-0.7.2/lib/slack/real_time/concurrency/celluloid.rb:36:in `block in run_loop'
    /Users/tikh/.rvm/gems/ruby-2.3.0/gems/slack-ruby-client-0.7.2/lib/slack/real_time/concurrency/celluloid.rb:36:in `loop'
    /Users/tikh/.rvm/gems/ruby-2.3.0/gems/slack-ruby-client-0.7.2/lib/slack/real_time/concurrency/celluloid.rb:36:in `run_loop'
    /Users/tikh/.rvm/gems/ruby-2.3.0/gems/slack-ruby-client-0.7.2/lib/slack/real_time/concurrency/celluloid.rb:29:in `connect!'
    /Users/tikh/.rvm/gems/ruby-2.3.0/gems/celluloid-0.17.3/lib/celluloid/calls.rb:28:in `public_send'
    /Users/tikh/.rvm/gems/ruby-2.3.0/gems/celluloid-0.17.3/lib/celluloid/calls.rb:28:in `dispatch'
    /Users/tikh/.rvm/gems/ruby-2.3.0/gems/celluloid-0.17.3/lib/celluloid/call/sync.rb:16:in `dispatch'
    /Users/tikh/.rvm/gems/ruby-2.3.0/gems/celluloid-0.17.3/lib/celluloid/cell.rb:50:in `block in dispatch'
    /Users/tikh/.rvm/gems/ruby-2.3.0/gems/celluloid-0.17.3/lib/celluloid/cell.rb:76:in `block in task'
    /Users/tikh/.rvm/gems/ruby-2.3.0/gems/celluloid-0.17.3/lib/celluloid/actor.rb:339:in `block in task'
    /Users/tikh/.rvm/gems/ruby-2.3.0/gems/celluloid-0.17.3/lib/celluloid/task.rb:44:in `block in initialize'
    /Users/tikh/.rvm/gems/ruby-2.3.0/gems/celluloid-0.17.3/lib/celluloid/task/fibered.rb:14:in `block in create'

dblock commented 8 years ago

Hmmm... check out https://github.com/celluloid/celluloid-io/issues/176 and https://github.com/dblock/slack-ruby-client/issues/90, does this explain any of this?

tikh commented 8 years ago

Actually if Slack integration is deleted, 'close' event happens correctly. When I was trying to simulate connection problems by turning off Wi-Fi and had periodic pings, I got exception above. Probably, Errno::ETIMEDOUT exceptions should be added to the rescue block where you already handle EOFError (https://github.com/dblock/slack-ruby-client/blob/master/lib/slack/real_time/concurrency/celluloid.rb#L38). Does it sound right? Or erroneous exceptions situations are supposed to be handle in different places?

dblock commented 8 years ago

Yes, that sounds just right! Maybe it should be a whole other range of exceptions, would appreciate some research too in other network conditions. Would love a PR!

dblock commented 8 years ago

I think another one is Errno:EPIPE and Errno::ECONNRESET:

E, [2016-06-07T19:05:15.066577 #154] ERROR -- : Actor crashed!
Errno::EPIPE: Broken pipe
    /app/vendor/ruby-2.2.4/lib/ruby/2.2.0/openssl/buffering.rb:383:in `syswrite_nonblock'
    /app/vendor/ruby-2.2.4/lib/ruby/2.2.0/openssl/buffering.rb:383:in `write_nonblock'
    /app/vendor/bundle/ruby/2.2.0/gems/celluloid-io-0.17.3/lib/celluloid/io/stream.rb:62:in `block in syswrite'
    /app/vendor/bundle/ruby/2.2.0/gems/celluloid-io-0.17.3/lib/celluloid/io/stream.rb:390:in `synchronize'
    /app/vendor/bundle/ruby/2.2.0/gems/celluloid-io-0.17.3/lib/celluloid/io/stream.rb:59:in `syswrite'
    /app/vendor/bundle/ruby/2.2.0/gems/celluloid-io-0.17.3/lib/celluloid/io/stream.rb:356:in `do_write'
    /app/vendor/bundle/ruby/2.2.0/gems/celluloid-io-0.17.3/lib/celluloid/io/stream.rb:250:in `write'
    /app/vendor/bundle/ruby/2.2.0/gems/slack-ruby-client-0.7.3/lib/slack/real_time/concurrency/celluloid.rb:56:in `write'
    /app/vendor/bundle/ruby/2.2.0/gems/websocket-driver-0.6.3/lib/websocket/driver/hybi.rb:232:in `send_frame'
    /app/vendor/bundle/ruby/2.2.0/gems/websocket-driver-0.6.3/lib/websocket/driver/hybi.rb:193:in `frame'
    /app/vendor/bundle/ruby/2.2.0/gems/websocket-driver-0.6.3/lib/websocket/driver/hybi.rb:156:in `close'
    /app/vendor/bundle/ruby/2.2.0/gems/slack-ruby-client-0.7.3/lib/slack/real_time/concurrency/celluloid.rb:45:in `close'
    /app/vendor/bundle/ruby/2.2.0/gems/celluloid-0.17.3/lib/celluloid/calls.rb:28:in `public_send'
    /app/vendor/bundle/ruby/2.2.0/gems/celluloid-0.17.3/lib/celluloid/calls.rb:28:in `dispatch'
    /app/vendor/bundle/ruby/2.2.0/gems/celluloid-0.17.3/lib/celluloid/call/sync.rb:16:in `dispatch'
    /app/vendor/bundle/ruby/2.2.0/gems/celluloid-0.17.3/lib/celluloid/cell.rb:50:in `block in dispatch'
    /app/vendor/bundle/ruby/2.2.0/gems/celluloid-0.17.3/lib/celluloid/cell.rb:76:in `block in task'
    /app/vendor/bundle/ruby/2.2.0/gems/celluloid-0.17.3/lib/celluloid/actor.rb:339:in `block in task'
    /app/vendor/bundle/ruby/2.2.0/gems/celluloid-0.17.3/lib/celluloid/task.rb:44:in `block in initialize'
    /app/vendor/bundle/ruby/2.2.0/gems/celluloid-0.17.3/lib/celluloid/task/fibered.rb:14:in `block in create'

dblock commented 8 years ago

For easy repro cases, https://gist.github.com/dblock/1250a5497ff3c7af81c05964c459ff86.

dblock commented 8 years ago

Closed the wrong issue, reopening ;)

slack-ruby / slack-ruby-client

Sporadic disconnects (close events) at start up #96