Closed Kyrremann closed 5 years ago
I am seeing the same thing on larger production bots. I am going to guess this is slack-side, but it also looks like we're not seeing/handling the disconnect properly somehow. I would say first lets understand what we expect on a disconnect like this? An exception?
For celluloid we want to emit a :close
like here. So I would add debug code around there to find out how this is caused.
Thanks for the reply. We have a little over 1100 users in our team, but I don't feel the problem started before more of the users started using my app. In the beginning it was only 10 users, but now we are just below 50 users.
One strange thing is that the bot goes down in the evening, when no one is at work.
About what we expect. I guess an exception with a reason would be a good start. I will also reach out to Slack, and hear with them.
I stuck a restart into my bots (eg. https://github.com/dblock/slack-strava/commit/bc293248bae678dd9299b1d2888443adedae4da8) and I am seeing about a dozen of these every hour on a bot system with hundreds of connections.
Did you hear back from slack? Either way we definitely don’t have correct handling code here and the client thinks the connection is still open - maybe we can reproduce this or at least attempt to handle this based on that data being recieved? What’s that data is a good question.
hey, there! i'm from slack and i'd like to help you folks get this issue resolved. while i do have some expertise in our RTM API and Websockets (i helped build our node implementation), my Ruby is a little rusty. as a first step, i'd like to reproduce this with minimal steps, in my own app/workspace. if anyone has any shortcuts for me, please share. otherwise i'll update here as i find things.
@dblock @Kyrremann I can also confirm the strange disconnects but not seeing any issues restarting via latest master
code.
We're not running close to hundreds of teams yet and getting to the restart!
handler just fine: https://github.com/slack-ruby/slack-ruby-bot-server/blob/master/lib/slack-ruby-bot-server/server.rb#L19
Attaching a log from Prod with names redacted to protect the innocent. The disconnects start happening around 10PM UTC.
Oh woops, just realized we're in slack-ruby-client
project, not slack-ruby-bot-server
. Well, I'll leave my comment in case it's helpful.
Hi @aoberoi and thanks for jumping in!
I think the easiest is to run a full blown bot. Alternatively we can write some code to open hundreds of RTM connections and see what happens. Here's a full bot:
Clone https://github.com/slack-ruby/slack-shellbot.
Create a slack app with Slack and create .env
with the following.
SLACK_CLIENT_ID=...
SLACK_CLIENT_SECRET=...
Run bundle install
, make sure you have a local MongoDB, then foreman start
. This will start a local instance of the bot. Navigate to http://localhost:5000
and you can install the bot with any team. Try DMing ls
or help
to the test bot and it should respond.
Registration creates a Team
object. I think this happens more often with many connections, so I would just add this to app.rb:
team = Team.first
100.times do
SlackShellbot::Service.instance.start!(team)
end
You should get 100 instances of the bot running.
Here's a recent down log from a bot with 297 teams.
W, [2018-06-13T22:08:38.463423 #212] WARN -- : DOWN: game=pong, name=, id=T02BTFT45
W, [2018-06-13T22:08:39.509848 #212] WARN -- : DOWN: game=pong, name=, id=T5ATQ5K6W
W, [2018-06-13T22:08:47.103501 #212] WARN -- : DOWN: game=pong, name=, id=T04LD63KU
W, [2018-06-13T22:08:47.571911 #212] WARN -- : DOWN: game=pong, name=, id=T6GF0T8UT
W, [2018-06-13T22:08:54.902196 #212] WARN -- : DOWN: game=pong, name=, id=TAV9ZRUBY
W, [2018-06-13T22:09:04.200495 #212] WARN -- : DOWN: game=tic-tac-toe, name=, id=TB1QA85PH
W, [2018-06-13T22:09:06.035861 #212] WARN -- : DOWN: game=pong, name=, id=T024XAHD7
W, [2018-06-13T22:09:06.789929 #212] WARN -- : DOWN: game=pong, name=, id=T0JT3RRV2
W, [2018-06-13T22:09:13.165823 #212] WARN -- : DOWN: game=pong, name=, id=T06N1SCTB
I, [2018-06-13T22:09:38.589368 #212] INFO -- : RESTART: game=pong, name=, id=T02BTFT45
I, [2018-06-13T22:09:38.589496 #212] INFO -- : Starting team game=pong, name=, id=T02BTFT45.
...
I left team IDs in there for you @aoberoi in case you have logs to look at on your side. These all went offline in the 5 minutes preceding this log.
Note that we also see successful disconnects/restarts.
I, [2018-06-14T00:10:29.391865 #212] INFO -- : T02990GNF: socket closed, restarting ...
I, [2018-06-14T00:10:29.586615 #212] INFO -- : Successfully connected team (T02990GNF) to https://....slack.com.
If you want to add more logging to the code that handles disconnects, put it here.
@dblock I forgot to contact the Slack team, but since @aoberoi joined the issue, I'll wait.
I havne't had time to work any more on this, but I have now had the bot running since Monday. Also, I'm just running one team.
Any luck with this? Our bot disconnects daily. It's a important piece of our infrastructure we use to coordinate deployments.
My bot disconnects (silently) about once every three days, on a team with only a handful of users. Not sure if that helps, but at least it's a data point.
I sent a support-ticket to the Slack team this morning. I'll let you know when I hear something. I'm afraid we won't get much help, as it's kinda outside their scope to support us (maybe?).
If someone understands what's being sent on the wire or wants to try to that would be helpful. In the end the websocket library simply doesn't see the disconnect.
I'm not really able to help with this issue directly, but there has been some rumblings of using a async
for concurrency, so I thought I'd pass over what I've done in the past to make this work.
https://gist.github.com/ioquatix/c24f107e2cc7f48e571a37e8e93b0cda?ts=2
Just FYI. Yes, I know it's a different gem, but it can't be all that different.
I'm sorry I've been absent from this issue for so long. I was traveling over the last 2 weeks for Slack and hadn't had any time to focus on what was going on here.
I tracked down logs on some of the specific disconnects that @dblock provided. From our logs, there was nothing abnormal.
In that sample, we observed 2-3 disconnects a day per bot user, which is slightly more than usual, but not extraordinary. RTM clients are expected to gracefully handle disconnects by reconnecting since the server may choose to close connections at any time when it's load balancing or updating.
On the days this was reported, Slack was experiencing some connectivity issues due to upstream AWS problems, but no degraded service. This is a potential explanation for the increased need for disconnects.
For some of the disconnects, the server recorded a cause=inactivity
, which is the result of our servers not receiving a pong
message for a ping
it sent to the websocket client.
I'll keep trying to reproduce this issue on my side (still have not observed it). I'd like to get a better understanding of how this library deals with routine disconnects (as described above).
I get the impression that the underlying library is unaware of a disconnect until it attempts to send a message and it fails, this is less than ideal. I'd like to help improve that situation if possible. Even if we cannot know until a message is sent, it would be better if we queued the message that needs to be sent, initiate a reconnect, sync upon reconnect, and then resume sending the message.
I'd also like to verify the ping/pong behavior, since there's a non-zero number of inactivity
disconnects.
Is anyone experiencing a much larger volume of disconnects (more than 5 a day per bot user)?
Are many disconnects (either across bot users or with the same bot user) occurring in small windows of time? For example, if there's 3 disconnects that happen at the exact same time and then no disconnects for the entire day after that.
If you can answer yes to any of the questions above, or have any new information, please send me the timestamp of your disconnects and the bot user ID and team ID you are connecting to. If you don't want to share that information here, you can email me at ankur@slack-corp.com
If you can assist with any of the action items above, that would speed things up. I'm not a ruby expert so this would take me some time.
I can't answer yes on any of the questions, but I was curios if other people also where running on Google Cloud Engine?
My bots are all on DigitalOcean.
I also can't answer yes, unfortunately; my bot runs locally here on-site (Ubuntu).
Just wanted to chime in and say I'm having the same problem, and am on Heroku.
My logs show that the bot is still logging conversations in the slack channel, but the Slack desktop client does not show its presence and the bot does not respond to commands.
Restarting the bot does fix the issue.
Not sure if this is a step forward, but, I added a ping function, that would try to stop and restart my client when it was offline from Slack.
Sadly that didn't work, as I ended up with a DeadActorError
from Celluloid
. The line that crashed was instance.stop!
, which makes me guess that Celluloid
already knows the bot is dead. Going to change the code to remove the .stop!
. See if it works with only start_async
.
The code I used is this:
get '/bot_presence' do
client = ::Slack::Web::Client.new(token: ENV['SLACK_API_TOKEN'])
status = client.users_getPresence(user: ENV['SLACK_BOT_ID'])
if status['presence'] == 'away'
instance = Standbot::Bot.instance
instance.stop!
instance.start_async
json(message: 'Restarting bot successfully')
return
end
json(message: status)
end
And the stack trace is this:
2018-07-18 06:31:00 default[20180716t093850] "GET /api/v2/bot_presence" 500
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | 2018-07-18 06:36:00 - Celluloid::DeadActorError - attempted to call a dead actor: connected?:
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/celluloid-0.17.3/lib/celluloid/proxy/sync.rb:9:in `method_missing'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/slack-ruby-client-0.11.1/lib/slack/real_time/client.rb:68:in `started?'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/slack-ruby-client-0.11.1/lib/slack/real_time/client.rb:63:in `stop!'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/slack-ruby-bot-0.11.1/lib/slack-ruby-bot/server.rb:48:in `stop!'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/standweb/routes/api.rb:92:in `block (3 levels) in <class:Web>'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1635:in `call'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1635:in `block in compile!'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:992:in `block (3 levels) in route!'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1011:in `route_eval'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:992:in `block (2 levels) in route!'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1040:in `block in process_route'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1038:in `catch'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1038:in `process_route'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:990:in `block in route!'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:989:in `each'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:989:in `route!'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1097:in `block in dispatch!'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1076:in `block in invoke'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1076:in `catch'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1076:in `invoke'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1094:in `dispatch!'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:924:in `block in call!'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1076:in `block in invoke'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1076:in `catch'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1076:in `invoke'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:924:in `call!'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:913:in `call'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/xss_header.rb:18:in `call'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/base.rb:50:in `call'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/base.rb:50:in `call'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/path_traversal.rb:16:in `call'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/json_csrf.rb:26:in `call'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/base.rb:50:in `call'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/base.rb:50:in `call'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/rack-protection-2.0.3/lib/rack/protection/frame_options.rb:31:in `call'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/rack-2.0.5/lib/rack/session/abstract/id.rb:232:in `context'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/rack-2.0.5/lib/rack/session/abstract/id.rb:226:in `call'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/rack-2.0.5/lib/rack/logger.rb:15:in `call'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/rack-2.0.5/lib/rack/common_logger.rb:33:in `call'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:231:in `call'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:224:in `call'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/rack-2.0.5/lib/rack/head.rb:12:in `call'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:194:in `call'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/sinatra-2.0.3/lib/sinatra/base.rb:1958:in `call'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/puma-3.11.4/lib/puma/configuration.rb:225:in `call'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/puma-3.11.4/lib/puma/server.rb:632:in `handle_request'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/puma-3.11.4/lib/puma/server.rb:446:in `process_client'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/puma-3.11.4/lib/puma/server.rb:306:in `block in run'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | /app/vendor/bundle/ruby/2.5.0/gems/puma-3.11.4/lib/puma/thread_pool.rb:120:in `block in spawn_thread'
2018-07-18 06:36:00 default[20180716t093850] 06:36:00 web.1 | 10.0.0.1, 10.0.0.1, 2002:ae8:f509:: - - [18/Jul/2018:06:36:00 +0000] "GET /api/v2/bot_presence HTTP/1.1" 500 30 0.1545
I failed to find how to "see" this in celluloid-io and it's apparently no longer maintained. Maybe someone can add support for the most-bestest async io library du jour? https://github.com/slack-ruby/slack-ruby-client/issues/210
My bot is also hosted on digital ocean. I had a restart once a day which seemed to fix it for quite a while, but recently the problem has been getting worst. I just restarted the bot in the afternoon to find the bot offline in the evening. Is there no way that we can take an active approach in determining the connection is still alive by sending our own PING
to the slack server every XX minutes and expecting some response back? If no response, just restart the connection? Wouldn't that be a workaround if the library is not properly discovering a disconnect?
I ended up doing like this, and just ping this endpoint every 5th minute.
get '/bot_presence' do
client = ::Slack::Web::Client.new(token: ENV['SLACK_API_TOKEN'])
status = client.users_getPresence(user: ENV['SLACK_BOT_ID'])
if status['presence'] == 'away'
instance = Standbot::Bot.instance
instance.start_async
json(message: 'Restarting bot successfully')
return
end
json(message: status)
end
@Kyrremann Seems like that may work, but there will still be instances of the object that are still "alive," but not connected to Slack, right?
Over the past week, I've seen my bot be down for some time and magically reconnect. There "seems" to be some improvement.
You should take a look to workarounds were introduced in https://github.com/slack-ruby/slack-ruby-bot/issues/107, i.e. mine https://github.com/slack-ruby/slack-ruby-bot/issues/107#issuecomment-398120946.
It would be most helpful if someone took a stab at #210.
I have spent some time on this, but have not made any progress. My questions and comments so far in https://github.com/celluloid/celluloid-io/issues/187#issuecomment-414131467.
Also opened https://github.com/faye/websocket-driver-ruby/issues/61, I am thinking the driver may be the problem here.
Debugging socket data with:
require 'slack/real_time/concurrency/celluloid'
require 'celluloid/internals/logger'
module Slack
module RealTime
module Concurrency
module Celluloid
class Socket < Slack::RealTime::Socket
include ::Celluloid::Internals::Logger
def log_info(message)
if message == @message
@count += 1
else
if @message && @message.is_a?(Array) && @count && @count > 1
logger.info(@message.concat(["repeated #{@count} times"]))
end
@count = 1
@message = message
logger.info(message)
end
end
def build_driver
@logger = Logger.new(STDOUT)
@logger.level = Logger::INFO
::WebSocket::Driver.client(self).tap do |ws|
ws.on :open do
log_info [ws.object_id, :server_open]
end
ws.on :message do |message|
log_info [ws.object_id, :server_message, message.data]
end
ws.on :close do |close|
log_info [ws.object_id, :server_close, close.code, close.reason]
end
ws.on :error do |error|
log_info [ws.object_id, :server_error, error.message]
end
ws.on :ping do |ping|
log_info [ws.object_id, :server_ping, ping.data]
end
ws.on :pong do |pong|
log_info [ws.object_id, :server_pong, pong.data]
end
end
end
end
end
end
end
end
If you're using slack-ruby-bot-server I have a PR with a ping worker that lives within the bot and avoids spawning new server instances in https://github.com/slack-ruby/slack-ruby-bot-server/pull/74. Please give it a try.
So I implemented async-websocket support with some help from @ioquatix in https://github.com/slack-ruby/slack-ruby-client/pull/219 and still seeing the same behavior where the bot doesn't notice it's being disconnected, so I think we have something else going on here than finger-pointing Celluloid :(
Can you tell me how to repro the issue and I will debug it on my end.
@dblock thanks for all the work on this!
Looking through the backtraces it seems like current_actor
from celluloid hasn't been "recycled" / restarted. Perhaps, rescuing this exception raised by Celluloid and recursively calling the run_loop
could work?
Maybe, the same logic could work with async-websocket as well.
Sorry, other than this I don't have any concrete ideas of how to fix 🤔
MAX_RETRIES = 100
def run_loop(retries = 0)
raise unless retries < MAX_RETRIES
@closing = false
@socket = build_socket
@connected = @socket.connect
driver.start
loop { read } if socket
rescue EOFError, Errno::ECONNRESET, Errno::EPIPE => e
logger.debug("#{self.class}##{__method__}") { e }
driver.emit(:close, WebSocket::Driver::CloseEvent.new(1001, 'server closed connection')) unless @closing
ensure
begin
current_actor.terminate if current_actor.alive?
rescue Celluloid::DeadActorError
run_loop(retries + 1)
rescue StandardError
nil
end
end
@ioquatix https://github.com/slack-ruby/slack-ruby-client/issues/208#issuecomment-397131196 is probably the simplest way to reproduce
@dblock is there any way we can make an integration test which reproduces the behaviour? Even if we just have to leave it running for a few days?
@dblock is there any way we can make an integration test which reproduces the behaviour? Even if we just have to leave it running for a few days?
Probably, I would start multiple instances of client and send messages (DMs?) via slack making sure each instance can see them. Or just make sure you're getting server pings after a certain duration. The issue is that we're just seeing a bunch of disconnected clients for no apparent reason that don't signal their death in any way, just sit there.
The disconnect behavior stopped across all my bots a couple of days ago at the same time (exact last timestamps look very similar: 2018-09-01T06:09:00.720693, 2018-09-01T06:07:52.146347, UTC).
Similar scenario. Haven't needed daily restart since: 2018-08-31T15:40:00
Perhaps this was on the the Slack side?
Started happening again this AM EST.
I'm really bummed about this issue.
My plan to work around it is to 1) tell my team that when the chatbot is silent that they need restart it 2) watch this GH story, when it closes, inform my team that we shouldn't need to restart our bots anymore. I decided to do that instead of adding an automatic "restart process every N minutes" code because I have the perception that this is a real problem that Slack is eventually going to address (hopefully in the short term).
I wanted to see if other's, especially Slack, shared that view.
It sounds like Slack (the company) has acknowledged the presence of this Slack-side bug via @aoberoi's contributions to this thread. Ankur, is that accurate? Is there another public place where this issue is being tracked or is this best one to date?
To add confidence that this is in fact a Slack side issue as opposed to something specific to the "slack-ruby" Gems like slack-ruby-client, I played around with a really minimal old gem that talks to the Slack Real Time API, slack-rtmapi to see if I could reproduce the problem. I did reproduce the problem. They went silent after a couple of days too.
🤞 that Slack/Ankur has some feedback on this to help myself and others here understand when we might be able to expect a server side fix that will resolve this issue (and likely many others spread across many other GitHub problems that depend on the Slack Real Time API.
Thanks!
We're in the exact same boat as you are @joegoggins.
There's a really thoughtful discussion on this in https://github.com/faye/websocket-driver-ruby/issues/61#issuecomment-419256695. The slack-ruby-client library will help via https://github.com/slack-ruby/slack-ruby-client/issues/223 (volunteers to implement welcome), but in the meantime you should/could add a status monitoring thread from https://github.com/slack-ruby/slack-ruby-bot-server/blob/master/lib/slack-ruby-bot-server/ping.rb if you're not using that library.
Awesome! Thanks for the update @dblock, I appreciate the expanded list of technical options here and added context about the nature of this problem. I will stick with my manual work-around for now, but watch these issues closely and likely adopt new tooling when it's in beta or merged/released. 👌
We recently worked around this bug by implementing a ping as suggested above.
Basically we start the Slack real-time client asynchronously, and then separately use the Slack web client to periodically check whether our bot is still online. If the bot ever goes offline, we shut down the real-time client and reconnect it.
Over the past week our bot has gone offline three times and been successfully reconnected by this technique. It's not ideal but it seems to be an effective workaround for us.
Code is here: https://github.com/carbonfive/socrates/pull/4/files
Our current workaround for this is to fire off a dyno reset on heroku, it goes off every hour, so we have about 15sec of downtime every hour and we have yet to see the disconnects since doing this. Known downtime vs unknown is always better for us. For larger, more interactive teams this may not be ideal. We'll be looking into @mattbrictson solution over the coming weeks.
FYI the ping worker in slack-ruby-bot-server is stable and production tested.
If you want to experiment try https://github.com/slack-ruby/slack-ruby-client/pull/226 that is being coded by someone working at Slack, which will eventually get merged and is implementing a proper ping in slack-ruby-client. Would love more people testing that implementation before we release anything.
I've merged a lower level ping implementation for async-websocket from https://github.com/slack-ruby/slack-ruby-client/pull/226, please give it a try. Celluloid and EM should be fairly straightforward from there if anyone wants to give it a shot.
@dblock I've tried out the code on master but doesn't quite solve the problem. I'm still having my bot crash with
Slack::RealTime::Client::ClientNotStartedError
/app/vendor/bundle/ruby/2.3.0/bundler/gems/slack-ruby-client-8702c27c749f/lib/slack/real_time/client.rb:108:in `run_ping!'
/app/vendor/bundle/ruby/2.3.0/bundler/gems/slack-ruby-client-8702c27c749f/lib/slack/real_time/concurrency/async.rb:23:in `block (2 levels) in start_async'
/app/vendor/bundle/ruby/2.3.0/gems/timers-4.1.2/lib/timers/timer.rb:98:in `fire'
I'm guessing that started?
is false for whatever reason here so it's re-raising the error. I don't know a ton about this code -- restart_async
certainly sounds like it would be asynchronous; perhaps line 112 is firing before the socket has finished restarting/connecting? (Which the tests would not catch since they just mock the socket)
In the last week my slack bot has started to disconnect from our Slack team. There are no logs of disconnect, or stack traces. I've started running the bot in debug-mode, and the last thing it logged before it disconnected was the following.
I'm not sure how to debug this my self, and I have no idea for what is wrong.
My setup is running on Google App Engine, with one process for a Sinatra web and a different process running the slackbot. Neither of the process stops, but the bot is offline in Slack.
The code for the project is available here: https://github.com/navikt/standbot
Any help would be welcoming!