Slack-side disconnects - Githubissues

Kyrremann commented 6 years ago

In the last week my slack bot has started to disconnect from our Slack team. There are no logs of disconnect, or stack traces. I've started running the bot in debug-mode, and the last thing it logged before it disconnected was the following.

2018-06-08 20:44:59 20:44:59 bot.1  | D, [2018-06-08T20:44:59.795621 #34] DEBUG -- Slack::RealTime::Concurrency::Celluloid::Socket#handle_read: �
2018-06-08 20:44:59 20:44:59 bot.1  | D, [2018-06-08T20:44:59.795799 #34] DEBUG -- Slack::RealTime::Concurrency::Celluloid::Socket#write: ��iP¸
2018-06-08 20:46:59 20:46:59 bot.1  | D, [2018-06-08T20:46:59.795748 #34] DEBUG -- Slack::RealTime::Concurrency::Celluloid::Socket#handle_read: �
2018-06-08 20:46:59 20:46:59 bot.1  | D, [2018-06-08T20:46:59.795976 #34] DEBUG -- Slack::RealTime::Concurrency::Celluloid::Socket#write: ����
2018-06-08 20:48:59 20:48:59 bot.1  | D, [2018-06-08T20:48:59.796764 #34] DEBUG -- Slack::RealTime::Concurrency::Celluloid::Socket#handle_read: �
2018-06-08 20:48:59 20:48:59 bot.1  | D, [2018-06-08T20:48:59.796915 #34] DEBUG -- Slack::RealTime::Concurrency::Celluloid::Socket#write: ����o
2018-06-08 20:50:59 20:50:59 bot.1  | D, [2018-06-08T20:50:59.797632 #34] DEBUG -- Slack::RealTime::Concurrency::Celluloid::Socket#handle_read: �
2018-06-08 20:50:59 20:50:59 bot.1  | D, [2018-06-08T20:50:59.797844 #34] DEBUG -- Slack::RealTime::Concurrency::Celluloid::Socket#write: ���
2018-06-08 20:50:59 20:50:59 bot.1  | �#

I'm not sure how to debug this my self, and I have no idea for what is wrong.

My setup is running on Google App Engine, with one process for a Sinatra web and a different process running the slackbot. Neither of the process stops, but the bot is offline in Slack.

The code for the project is available here: https://github.com/navikt/standbot

Any help would be welcoming!

dblock commented 6 years ago

I am seeing the same thing on larger production bots. I am going to guess this is slack-side, but it also looks like we're not seeing/handling the disconnect properly somehow. I would say first lets understand what we expect on a disconnect like this? An exception?

dblock commented 6 years ago

For celluloid we want to emit a :close like here. So I would add debug code around there to find out how this is caused.

Kyrremann commented 6 years ago

Thanks for the reply. We have a little over 1100 users in our team, but I don't feel the problem started before more of the users started using my app. In the beginning it was only 10 users, but now we are just below 50 users.

One strange thing is that the bot goes down in the evening, when no one is at work.

About what we expect. I guess an exception with a reason would be a good start. I will also reach out to Slack, and hear with them.

dblock commented 6 years ago

I stuck a restart into my bots (eg. https://github.com/dblock/slack-strava/commit/bc293248bae678dd9299b1d2888443adedae4da8) and I am seeing about a dozen of these every hour on a bot system with hundreds of connections.

Did you hear back from slack? Either way we definitely don’t have correct handling code here and the client thinks the connection is still open - maybe we can reproduce this or at least attempt to handle this based on that data being recieved? What’s that data is a good question.

aoberoi commented 6 years ago

hey, there! i'm from slack and i'd like to help you folks get this issue resolved. while i do have some expertise in our RTM API and Websockets (i helped build our node implementation), my Ruby is a little rusty. as a first step, i'd like to reproduce this with minimal steps, in my own app/workspace. if anyone has any shortcuts for me, please share. otherwise i'll update here as i find things.

alexagranov commented 6 years ago

@dblock @Kyrremann I can also confirm the strange disconnects but not seeing any issues restarting via latest master code.

We're not running close to hundreds of teams yet and getting to the restart! handler just fine: https://github.com/slack-ruby/slack-ruby-bot-server/blob/master/lib/slack-ruby-bot-server/server.rb#L19

Attaching a log from Prod with names redacted to protect the innocent. The disconnects start happening around 10PM UTC. slack_restarts

alexagranov commented 6 years ago

Oh woops, just realized we're in slack-ruby-client project, not slack-ruby-bot-server. Well, I'll leave my comment in case it's helpful.

dblock commented 6 years ago

Hi @aoberoi and thanks for jumping in!

I think the easiest is to run a full blown bot. Alternatively we can write some code to open hundreds of RTM connections and see what happens. Here's a full bot:

Clone https://github.com/slack-ruby/slack-shellbot. Create a slack app with Slack and create .env with the following.

SLACK_CLIENT_ID=...
SLACK_CLIENT_SECRET=...

Run bundle install, make sure you have a local MongoDB, then foreman start. This will start a local instance of the bot. Navigate to http://localhost:5000 and you can install the bot with any team. Try DMing ls or help to the test bot and it should respond.

Registration creates a Team object. I think this happens more often with many connections, so I would just add this to app.rb:

team = Team.first
100.times do
   SlackShellbot::Service.instance.start!(team)
end

You should get 100 instances of the bot running.

dblock commented 6 years ago

Here's a recent down log from a bot with 297 teams.

W, [2018-06-13T22:08:38.463423 #212]  WARN -- : DOWN: game=pong, name=, id=T02BTFT45
W, [2018-06-13T22:08:39.509848 #212]  WARN -- : DOWN: game=pong, name=, id=T5ATQ5K6W
W, [2018-06-13T22:08:47.103501 #212]  WARN -- : DOWN: game=pong, name=, id=T04LD63KU
W, [2018-06-13T22:08:47.571911 #212]  WARN -- : DOWN: game=pong, name=, id=T6GF0T8UT
W, [2018-06-13T22:08:54.902196 #212]  WARN -- : DOWN: game=pong, name=, id=TAV9ZRUBY
W, [2018-06-13T22:09:04.200495 #212]  WARN -- : DOWN: game=tic-tac-toe, name=, id=TB1QA85PH
W, [2018-06-13T22:09:06.035861 #212]  WARN -- : DOWN: game=pong, name=, id=T024XAHD7
W, [2018-06-13T22:09:06.789929 #212]  WARN -- : DOWN: game=pong, name=, id=T0JT3RRV2
W, [2018-06-13T22:09:13.165823 #212]  WARN -- : DOWN: game=pong, name=, id=T06N1SCTB
I, [2018-06-13T22:09:38.589368 #212]  INFO -- : RESTART: game=pong, name=, id=T02BTFT45
I, [2018-06-13T22:09:38.589496 #212]  INFO -- : Starting team game=pong, name=, id=T02BTFT45.
...

I left team IDs in there for you @aoberoi in case you have logs to look at on your side. These all went offline in the 5 minutes preceding this log.

Note that we also see successful disconnects/restarts.

I, [2018-06-14T00:10:29.391865 #212]  INFO -- : T02990GNF: socket closed, restarting ...
I, [2018-06-14T00:10:29.586615 #212]  INFO -- : Successfully connected team  (T02990GNF) to https://....slack.com.

dblock commented 6 years ago

If you want to add more logging to the code that handles disconnects, put it here.

Kyrremann commented 6 years ago

@dblock I forgot to contact the Slack team, but since @aoberoi joined the issue, I'll wait.

I havne't had time to work any more on this, but I have now had the bot running since Monday. Also, I'm just running one team.

andrewhood125 commented 6 years ago

Any luck with this? Our bot disconnects daily. It's a important piece of our infrastructure we use to coordinate deployments.

jhawksley commented 6 years ago

My bot disconnects (silently) about once every three days, on a team with only a handful of users. Not sure if that helps, but at least it's a data point.

Kyrremann commented 6 years ago

I sent a support-ticket to the Slack team this morning. I'll let you know when I hear something. I'm afraid we won't get much help, as it's kinda outside their scope to support us (maybe?).

dblock commented 6 years ago

If someone understands what's being sent on the wire or wants to try to that would be helpful. In the end the websocket library simply doesn't see the disconnect.

ioquatix commented 6 years ago

I'm not really able to help with this issue directly, but there has been some rumblings of using a async for concurrency, so I thought I'd pass over what I've done in the past to make this work.

https://gist.github.com/ioquatix/c24f107e2cc7f48e571a37e8e93b0cda?ts=2

Just FYI. Yes, I know it's a different gem, but it can't be all that different.

aoberoi commented 6 years ago

I'm sorry I've been absent from this issue for so long. I was traveling over the last 2 weeks for Slack and hadn't had any time to focus on what was going on here.

Notes:

I tracked down logs on some of the specific disconnects that @dblock provided. From our logs, there was nothing abnormal.
In that sample, we observed 2-3 disconnects a day per bot user, which is slightly more than usual, but not extraordinary. RTM clients are expected to gracefully handle disconnects by reconnecting since the server may choose to close connections at any time when it's load balancing or updating.
On the days this was reported, Slack was experiencing some connectivity issues due to upstream AWS problems, but no degraded service. This is a potential explanation for the increased need for disconnects.
For some of the disconnects, the server recorded a cause=inactivity, which is the result of our servers not receiving a pong message for a ping it sent to the websocket client.

Action Items:

I'll keep trying to reproduce this issue on my side (still have not observed it). I'd like to get a better understanding of how this library deals with routine disconnects (as described above).
I get the impression that the underlying library is unaware of a disconnect until it attempts to send a message and it fails, this is less than ideal. I'd like to help improve that situation if possible. Even if we cannot know until a message is sent, it would be better if we queued the message that needs to be sent, initiate a reconnect, sync upon reconnect, and then resume sending the message.
I'd also like to verify the ping/pong behavior, since there's a non-zero number of inactivity disconnects.

Questions:

Is anyone experiencing a much larger volume of disconnects (more than 5 a day per bot user)?
Are many disconnects (either across bot users or with the same bot user) occurring in small windows of time? For example, if there's 3 disconnects that happen at the exact same time and then no disconnects for the entire day after that.

What can you do?

If you can answer yes to any of the questions above, or have any new information, please send me the timestamp of your disconnects and the bot user ID and team ID you are connecting to. If you don't want to share that information here, you can email me at ankur@slack-corp.com
If you can assist with any of the action items above, that would speed things up. I'm not a ruby expert so this would take me some time.

Kyrremann commented 6 years ago

I can't answer yes on any of the questions, but I was curios if other people also where running on Google Cloud Engine?

dblock commented 6 years ago

My bots are all on DigitalOcean.

jhawksley commented 6 years ago

I also can't answer yes, unfortunately; my bot runs locally here on-site (Ubuntu).

olingern commented 6 years ago

Just wanted to chime in and say I'm having the same problem, and am on Heroku.

My logs show that the bot is still logging conversations in the slack channel, but the Slack desktop client does not show its presence and the bot does not respond to commands.

Restarting the bot does fix the issue.

Kyrremann commented 6 years ago

Not sure if this is a step forward, but, I added a ping function, that would try to stop and restart my client when it was offline from Slack.

Sadly that didn't work, as I ended up with a DeadActorError from Celluloid. The line that crashed was instance.stop!, which makes me guess that Celluloid already knows the bot is dead. Going to change the code to remove the .stop!. See if it works with only start_async.

The code I used is this:

        get '/bot_presence' do
          client = ::Slack::Web::Client.new(token: ENV['SLACK_API_TOKEN'])
          status = client.users_getPresence(user: ENV['SLACK_BOT_ID'])
          if status['presence'] == 'away'
            instance = Standbot::Bot.instance
            instance.stop!
            instance.start_async
            json(message: 'Restarting bot successfully')
            return
          end
          json(message: status)
        end

And the stack trace is this:

2018-07-18 06:31:00 default[20180716t093850]  "GET /api/v2/bot_presence" 500
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 2018-07-18 06:36:00 - Celluloid::DeadActorError - attempted to call a dead actor: connected?:
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/celluloid-0.17.3/lib/celluloid/proxy/sync.rb:9:in `method_missing'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/slack-ruby-client-0.11.1/lib/slack/real_time/client.rb:68:in `started?'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/slack-ruby-client-0.11.1/lib/slack/real_time/client.rb:63:in `stop!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/slack-ruby-bot-0.11.1/lib/slack-ruby-bot/server.rb:48:in `stop!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/standweb/routes/api.rb:92:in `block (3 levels) in <class:Web>'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1635:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1635:in `block in compile!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:992:in `block (3 levels) in route!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1011:in `route_eval'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:992:in `block (2 levels) in route!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1040:in `block in process_route'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1038:in `catch'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1038:in `process_route'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:990:in `block in route!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:989:in `each'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:989:in `route!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1097:in `block in dispatch!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1076:in `block in invoke'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1076:in `catch'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1076:in `invoke'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1094:in `dispatch!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:924:in `block in call!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1076:in `block in invoke'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1076:in `catch'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1076:in `invoke'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:924:in `call!'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:913:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/xss_header.rb:18:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/base.rb:50:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/base.rb:50:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/path_traversal.rb:16:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/json_csrf.rb:26:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/base.rb:50:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/base.rb:50:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/frame_options.rb:31:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/rack-2.0.5/lib/rack/session/abstract/id.rb:232:in `context'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/rack-2.0.5/lib/rack/session/abstract/id.rb:226:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/rack-2.0.5/lib/rack/logger.rb:15:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/rack-2.0.5/lib/rack/common_logger.rb:33:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:231:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:224:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/rack-2.0.5/lib/rack/head.rb:12:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:194:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1958:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/puma-3.11.4/lib/puma/configuration.rb:225:in `call'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/puma-3.11.4/lib/puma/server.rb:632:in `handle_request'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/puma-3.11.4/lib/puma/server.rb:446:in `process_client'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/puma-3.11.4/lib/puma/server.rb:306:in `block in run'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  |     /app/vendor/bundle/ruby/2.5.0/gems/puma-3.11.4/lib/puma/thread_pool.rb:120:in `block in spawn_thread'
2018-07-18 06:36:00 default[20180716t093850]  06:36:00 web.1  | 10.0.0.1, 10.0.0.1, 2002:ae8:f509:: - - [18/Jul/2018:06:36:00 +0000] "GET /api/v2/bot_presence HTTP/1.1" 500 30 0.1545

dblock commented 6 years ago

I failed to find how to "see" this in celluloid-io and it's apparently no longer maintained. Maybe someone can add support for the most-bestest async io library du jour? https://github.com/slack-ruby/slack-ruby-client/issues/210

jonmchan commented 6 years ago

My bot is also hosted on digital ocean. I had a restart once a day which seemed to fix it for quite a while, but recently the problem has been getting worst. I just restarted the bot in the afternoon to find the bot offline in the evening. Is there no way that we can take an active approach in determining the connection is still alive by sending our own PING to the slack server every XX minutes and expecting some response back? If no response, just restart the connection? Wouldn't that be a workaround if the library is not properly discovering a disconnect?

Kyrremann commented 6 years ago

I ended up doing like this, and just ping this endpoint every 5th minute.

get '/bot_presence' do
  client = ::Slack::Web::Client.new(token: ENV['SLACK_API_TOKEN'])
  status = client.users_getPresence(user: ENV['SLACK_BOT_ID'])
  if status['presence'] == 'away'
    instance = Standbot::Bot.instance
    instance.start_async
    json(message: 'Restarting bot successfully')
    return
  end
  json(message: status)
end

olingern commented 6 years ago

@Kyrremann Seems like that may work, but there will still be instances of the object that are still "alive," but not connected to Slack, right?

Over the past week, I've seen my bot be down for some time and magically reconnect. There "seems" to be some improvement.

gsmetal commented 6 years ago

You should take a look to workarounds were introduced in https://github.com/slack-ruby/slack-ruby-bot/issues/107, i.e. mine https://github.com/slack-ruby/slack-ruby-bot/issues/107#issuecomment-398120946.

dblock commented 6 years ago

It would be most helpful if someone took a stab at #210.

dblock commented 6 years ago

I have spent some time on this, but have not made any progress. My questions and comments so far in https://github.com/celluloid/celluloid-io/issues/187#issuecomment-414131467.

dblock commented 6 years ago

Also opened https://github.com/faye/websocket-driver-ruby/issues/61, I am thinking the driver may be the problem here.

dblock commented 6 years ago

Debugging socket data with:

require 'slack/real_time/concurrency/celluloid'
require 'celluloid/internals/logger'

module Slack
  module RealTime
    module Concurrency
      module Celluloid
        class Socket < Slack::RealTime::Socket
          include ::Celluloid::Internals::Logger

          def log_info(message)
            if message == @message
              @count += 1
            else
              if @message && @message.is_a?(Array) && @count && @count > 1
                logger.info(@message.concat(["repeated #{@count} times"]))
              end
              @count = 1
              @message = message
              logger.info(message)
            end
          end

          def build_driver
            @logger = Logger.new(STDOUT)
            @logger.level = Logger::INFO
            ::WebSocket::Driver.client(self).tap do |ws|
              ws.on :open do
                log_info [ws.object_id, :server_open]
              end

              ws.on :message do |message|
                log_info [ws.object_id, :server_message, message.data]
              end

              ws.on :close do |close|
                log_info [ws.object_id, :server_close, close.code, close.reason]
              end

              ws.on :error do |error|
                log_info [ws.object_id, :server_error, error.message]
              end

              ws.on :ping do |ping|
                log_info [ws.object_id, :server_ping, ping.data]
              end

              ws.on :pong do |pong|
                log_info [ws.object_id, :server_pong, pong.data]
              end
            end
          end
        end
      end
    end
  end
end

dblock commented 6 years ago

If you're using slack-ruby-bot-server I have a PR with a ping worker that lives within the bot and avoids spawning new server instances in https://github.com/slack-ruby/slack-ruby-bot-server/pull/74. Please give it a try.

dblock commented 6 years ago

So I implemented async-websocket support with some help from @ioquatix in https://github.com/slack-ruby/slack-ruby-client/pull/219 and still seeing the same behavior where the bot doesn't notice it's being disconnected, so I think we have something else going on here than finger-pointing Celluloid :(

ioquatix commented 6 years ago

Can you tell me how to repro the issue and I will debug it on my end.

olingern commented 6 years ago

@dblock thanks for all the work on this!

Looking through the backtraces it seems like current_actor from celluloid hasn't been "recycled" / restarted. Perhaps, rescuing this exception raised by Celluloid and recursively calling the run_loop could work?

Maybe, the same logic could work with async-websocket as well.

Sorry, other than this I don't have any concrete ideas of how to fix 🤔

MAX_RETRIES = 100
def run_loop(retries = 0)
  raise unless retries < MAX_RETRIES

  @closing = false
  @socket = build_socket
  @connected = @socket.connect
  driver.start
  loop { read } if socket
rescue EOFError, Errno::ECONNRESET, Errno::EPIPE => e
  logger.debug("#{self.class}##{__method__}") { e }
  driver.emit(:close, WebSocket::Driver::CloseEvent.new(1001, 'server closed connection')) unless @closing
ensure
  begin
    current_actor.terminate if current_actor.alive?
  rescue  Celluloid::DeadActorError
    run_loop(retries + 1)
  rescue StandardError
    nil
  end
end

dblock commented 6 years ago

@ioquatix https://github.com/slack-ruby/slack-ruby-client/issues/208#issuecomment-397131196 is probably the simplest way to reproduce

ioquatix commented 6 years ago

@dblock is there any way we can make an integration test which reproduces the behaviour? Even if we just have to leave it running for a few days?

dblock commented 6 years ago

@dblock is there any way we can make an integration test which reproduces the behaviour? Even if we just have to leave it running for a few days?

Probably, I would start multiple instances of client and send messages (DMs?) via slack making sure each instance can see them. Or just make sure you're getting server pings after a certain duration. The issue is that we're just seeing a bunch of disconnected clients for no apparent reason that don't signal their death in any way, just sit there.

dblock commented 6 years ago

The disconnect behavior stopped across all my bots a couple of days ago at the same time (exact last timestamps look very similar: 2018-09-01T06:09:00.720693, 2018-09-01T06:07:52.146347, UTC).

olingern commented 6 years ago

Similar scenario. Haven't needed daily restart since: 2018-08-31T15:40:00

Perhaps this was on the the Slack side?

dblock commented 6 years ago

Started happening again this AM EST.

joegoggins commented 6 years ago

I'm really bummed about this issue.

My plan to work around it is to 1) tell my team that when the chatbot is silent that they need restart it 2) watch this GH story, when it closes, inform my team that we shouldn't need to restart our bots anymore. I decided to do that instead of adding an automatic "restart process every N minutes" code because I have the perception that this is a real problem that Slack is eventually going to address (hopefully in the short term).

I wanted to see if other's, especially Slack, shared that view.

It sounds like Slack (the company) has acknowledged the presence of this Slack-side bug via @aoberoi's contributions to this thread. Ankur, is that accurate? Is there another public place where this issue is being tracked or is this best one to date?

To add confidence that this is in fact a Slack side issue as opposed to something specific to the "slack-ruby" Gems like slack-ruby-client, I played around with a really minimal old gem that talks to the Slack Real Time API, slack-rtmapi to see if I could reproduce the problem. I did reproduce the problem. They went silent after a couple of days too.

🤞 that Slack/Ankur has some feedback on this to help myself and others here understand when we might be able to expect a server side fix that will resolve this issue (and likely many others spread across many other GitHub problems that depend on the Slack Real Time API.

Thanks!

andrewhood125 commented 6 years ago

We're in the exact same boat as you are @joegoggins.

dblock commented 6 years ago

There's a really thoughtful discussion on this in https://github.com/faye/websocket-driver-ruby/issues/61#issuecomment-419256695. The slack-ruby-client library will help via https://github.com/slack-ruby/slack-ruby-client/issues/223 (volunteers to implement welcome), but in the meantime you should/could add a status monitoring thread from https://github.com/slack-ruby/slack-ruby-bot-server/blob/master/lib/slack-ruby-bot-server/ping.rb if you're not using that library.

joegoggins commented 6 years ago

Awesome! Thanks for the update @dblock, I appreciate the expanded list of technical options here and added context about the nature of this problem. I will stick with my manual work-around for now, but watch these issues closely and likely adopt new tooling when it's in beta or merged/released. 👌

mattbrictson commented 6 years ago

We recently worked around this bug by implementing a ping as suggested above.

Basically we start the Slack real-time client asynchronously, and then separately use the Slack web client to periodically check whether our bot is still online. If the bot ever goes offline, we shut down the real-time client and reconnect it.

Over the past week our bot has gone offline three times and been successfully reconnected by this technique. It's not ideal but it seems to be an effective workaround for us.

Code is here: https://github.com/carbonfive/socrates/pull/4/files

TallOrderDev commented 6 years ago

Our current workaround for this is to fire off a dyno reset on heroku, it goes off every hour, so we have about 15sec of downtime every hour and we have yet to see the disconnects since doing this. Known downtime vs unknown is always better for us. For larger, more interactive teams this may not be ideal. We'll be looking into @mattbrictson solution over the coming weeks.

dblock commented 6 years ago

FYI the ping worker in slack-ruby-bot-server is stable and production tested.

If you want to experiment try https://github.com/slack-ruby/slack-ruby-client/pull/226 that is being coded by someone working at Slack, which will eventually get merged and is implementing a proper ping in slack-ruby-client. Would love more people testing that implementation before we release anything.

dblock commented 6 years ago

I've merged a lower level ping implementation for async-websocket from https://github.com/slack-ruby/slack-ruby-client/pull/226, please give it a try. Celluloid and EM should be fairly straightforward from there if anyone wants to give it a shot.

schneidmaster commented 6 years ago

@dblock I've tried out the code on master but doesn't quite solve the problem. I'm still having my bot crash with

Slack::RealTime::Client::ClientNotStartedError
/app/vendor/bundle/ruby/2.3.0/bundler/gems/slack-ruby-client-8702c27c749f/lib/slack/real_time/client.rb:108:in `run_ping!'
/app/vendor/bundle/ruby/2.3.0/bundler/gems/slack-ruby-client-8702c27c749f/lib/slack/real_time/concurrency/async.rb:23:in `block (2 levels) in start_async'
/app/vendor/bundle/ruby/2.3.0/gems/timers-4.1.2/lib/timers/timer.rb:98:in `fire'

I'm guessing that started? is false for whatever reason here so it's re-raising the error. I don't know a ton about this code -- restart_async certainly sounds like it would be asynchronous; perhaps line 112 is firing before the socket has finished restarting/connecting? (Which the tests would not catch since they just mock the socket)

slack-ruby / slack-ruby-client

Slack-side disconnects #208

Notes:

Action Items:

Questions:

What can you do?