Open gdhardy1 opened 3 weeks ago
Thanks in advance for investigating. I'm happy to provide any additional information. I'd appreciate any temporary workaround ideas in the short term. Being unexpectedly rate limited during business hours would be a huge disruption as the app where I experienced this is in the critical path for users seeking engineering support from my organization.
I found some time to make a reproduction experiment for what I observed.
The issue seems to lie within in @slack/socket-mode
.
SlackWebSocket
encounters this 409
error, it calls to disconnect then it calls to terminateclose
event to the SocketModeClient
and the client is started againI see that the SocketModeClient
is internal to SocketModeReceiver
,autoReconnectEnabled
is true
by default, and this option is not configureable via AppOptions or SocketModeReceiverOptions. Additionally, isRecoverable
is true
for rate limited errors.
Given this, it seems that once a rate limit error is encountered, there isn't a way to recover and the call to apps.connections.open
will be retried indefinitely.
Hey @gdhardy1 , thank you so much for the excellent reproduction materials 🙏 It makes it so much easier to try to help when the problem is clearly demonstrated ❤️
I acknowledge that it is frustrating to see this affecting your business app. Let's see if we can figure out a workaround or how we should go about getting this fixed.
A few things that stand out to me immediately:
@slack/socket-mode
or @slack/web-api
, should be honouring any Retry-After
headers in terms of reconnection attempts or rate limiting - are you not seeing that happen? Do you happen to have any debug logs for the 429 response? socket-mode
or bolt AppOptions
or SocketModeReceiverOptions
, would that help you in the short-term? If so, that's an easy fix/feature we can release pretty quickly to, at least, stem the bleeding.I wonder what the proper behaviour here is. It feels to me like it would be reasonable to have the socket connection attempt to be re-established in a (presumably?) temporary issue like an HTTP 409.
Do you have any other thoughts on what expected behaviour should be in the unfortunate circumstance that socket endpoint is sending 409s?
The Slack tools, be it @slack/socket-mode or @slack/web-api, should be honouring any Retry-After headers in terms of reconnection attempts or rate limiting - are you not seeing that happen?
Not exactly. I believe there are two issues that lead to multiple retries being executed simultaneously.
When the socket connection fails with the 409
error, both the error
and close
events are emitted from the WebSocket
This leads to SocketModeClient
calling delayReconnectAttempt twice for a single failure
this.clientPingTimeoutMS * this.numOfConsecutiveReconnectionFailures
and this.clientPingTimeoutMS * this.numOfConsecutiveReconnectionFailures + 1
. This happens because both the error
and close
event handlers call SocketModeClient.disconnect
To avoid this, SocketModeClient
should check if a retry is pending before calling delayReconnectAttempt
.
numOfConsecutiveReconnectionFailures is reset when apps.connections.open
succeeds.
This can lead to retries being scheduled earlier than they should.
Consider the scenario where:
apps.connections.open
succeeds and numOfConsecutiveReconnectionFailures
set to 0` The retry timing will be: | Time (ms) | Event |
---|---|---|
0 | apps.connections.open succeeds; socket connect fails |
|
5000 | Retry 1 starts; apps.connections.open succeeds; socket connection fails |
|
10000 | Retry 2 starts AND Retry 3 starts | |
15000 | Retry 4 starts |
You can replicate this by running the experiment with MAX_CALLS=4
.
I suspect that either of following would be preferable:
numOfConsecutiveReconnectionFailures
only when the socket connection is actually established ORdelayReconnectAttempt
retry mechanisms for the apps.connections.open
call and WebSocket connection failure modes However, that might not be necessary as long as Issue 1 is addressed.
Do you happen to have any debug logs for the 429 response?
I wasn't able to capture debug level logs when I encountered this in production. But the reproduction experiment is pretty close if not identical I think.
If you did have access to any of the options you linked to, be it for socket-mode or bolt AppOptions or SocketModeReceiverOptions, would that help you in the short-term?
Actually, after trying, I don't think so. I thought setting autoReconnectEnabled
to false
in conjunction with rejectRateLimitedCalls
on the WebClient
might help me catch and implement my own retry logic but that seemed like more effort than just fixing the root problem
Thanks for the analysis. I've moved this issue over to the node-slack-sdk (the home of the socket-mode
client); will sift through everything you've posted and see if I can capture the issue as a test to help frame the discussion.
Separate issue: I think I will riff on your repro a bit to add Retry-After
header responses to the apps.connections.open
endpoint to see if retry handling works as expected. Will see what kind of responses that API raises in a rate limit scenario.
Hello, I recently encountered the same issue.
When running two instances of a Bolt app, one of the apps would unexpectedly disconnect from the socket connection. During the reconnection process, it would fail repeatedly, doubling the reconnection attempts each time. This cycle continued until it led to an Out of Memory error, eventually causing the app to crash.
[ERROR] socket-mode:SocketModeClient:0 WebSocket error! Error: Failed to send message on websocket: Unexpected server response: 409
[ERROR] socket-mode:SlackWebSocket:1843 WebSocket error occurred: Unexpected server response: 409
[WARN] bolt-app http request failed A rate limit was exceeded (url: apps.connections.open, retry-after: 2)
Update here: I have escalated this issue to the real-time server backend team to raise the issue related to unexpected 409 responses from that server.
I am still working on addressing the issue of what this socket-mode library should do in this situation and how to prevent it from getting into a rate-limit situation with the apps.connections.open
API.
@bitofsky upgrading to @slack/bolt 4.1.0
should resolve the Out of Memory error you're experiencing. I experienced the same Out of Memory error running 3.21.4
@matthewhenry1 Thanks for the response. I was running on @slack/bolt 4.0.1
. I'll upgrade to 4.1.0
and test if the issue persists.
The underlying issue still exists in bolt 4.x, the only thing upgrading to 4.x may help is if you experience crashes with a class of errors in the form "unhandled event X in state Y", e.g. "Unhandled event 'server explicit disconnect' in state 'connecting'." (as described in this comment).
Just to add some more info: I'm also experiencing this issue in a distributed setup, and I'm also sharing app tokens across environments (local, staging, live) which is maybe a bad idea. I'm consistently getting 409 errors. Looking at the logs it seemingly starts with a random 409
response, then ends up in an out of control (seemingly exponential) chain reaction of reconnections - as others have seen. I might have to remove the integration as it's spamming our logs (and CPU resource) so much.
Given it's probably a distributed problem - could it be some connection limit issue? The original example of this issue uses 5 instances, which is less than the max 10 connection limit shown here: https://api.slack.com/apis/socket-mode#connections. Could this limit be the cause of the 409
errors, and could this limit be wrong in the docs - i.e. lower than it mentions? I'm fairly sure I don't have 10 instances in total at any time though.
Related question: Does anyone know if this limit is per app token or per app?
I'm using latest version 4.1.0
.
@matthewhenry1 Thanks for the response. I was running on @slack/bolt
4.0.1
. I'll upgrade to4.1.0
and test if the issue persists.
Issues are still occurring in version 4.1.0.
This does seem to be an infrastructural issue; from what I understand, the 409 response should be an internal-only signal for the real-time backend system to find another container to serve the connection request. For whatever reason (a recent change or bug), it is bubbling up to the public / to the apps. The infrastructure team is still investigating the cause.
IBM is hitting this issue attempting to upgrade a major Slack application. Our only workaround was to decrease the number of instances of the application to 1. That is an unacceptable workaround. The reason we are upgrading is because of Unhandled event 'server explicit disconnect causing frequent, unexpected restarts of all the app instances. How can we proceed if web socket 1.x is a problem and web socket 2.x is a problem? We need fixes or workarounds.
We just experienced another outage today. Luckily, we had alerting set up so that we could mitigate within roughly 10-15 minutes. But honestly, even that amount of downtime is tremendously painful for our users.
In case it's helpful with isolating a change that introduced the bug, we started observing the 409 errors on Oct 28th or 29th. Before that time, these errors virtually never occurred.
What makes this especially painful is that recovery pretty much requires human intervention and downtime is guaranteed:
I also encountered the same issue. Because I have two instances of a Bolt app. How could I resolve this problem? or how could I detect error and disconnected the Bolt app
[ERROR] socket-mode:SocketModeClient:0 WebSocket error! Error: Failed to send message on websocket: Unexpected server response: 409
[ERROR] socket-mode:SlackWebSocket:1843 WebSocket error occurred: Unexpected server response: 409
[WARN] bolt-app http request failed A rate limit was exceeded (url: apps.connections.open, retry-after: 2)
Update: the backend team is still investigating the 409 issue. There has been some progress but it still being worked on.
I have made some progress on mitigating the double-reconnect logic identified by @gdhardy1 ("Issue 1" in their comment here). Will share a draft PR soon.
Still a few other things need to be untangled to make more meaningful progress: how should the socket library react to a 409 generally speaking? Should reconnects be disabled altogether, even if the consumer specifically turned reconnections on? Additionally, proper handling of WS endpoint reconnection rate limiting needs to be explored.
I have a draft PR that at least depicts the issue in an integration test in this repo: https://github.com/slackapi/node-slack-sdk/pull/2099
Socket mode receiver fails to make a successful connection due to
WebSocket error occurred: Unexpected server response: 409
error. Retries after this error occurs may eventually lead to a successful connection or worse, rate limiting from Slack servers on theapps.connections.open
call.@slack/bolt
version4.0.1
Your
App
and Receiver ConfigurationSocket mode 2.0.2
Node.js runtime version
v20.17.0
Steps to reproduce:
This error seems to reproduce only when launching in a distributed environment (i.e. launching 2 or more instances of the app). Nothing special is being done on start up. It is minimally reproducible using the example from BoltJS Getting Started docs as a basis.
process.env.PORT
to a different number in each shellExpected result:
Each instance of the app should start up successfully without the
WebSocket error occurred: Unexpected server response: 409
errorActual result:
First we see the Socket Mode client successfully initialized.
Then we see the
409
error and the following logs repeat over and over again until:Additional Notes
It seems that a successfully established connection does not mean the app is in a "safe" state. While running, I've observed both
Primary WebSocket error occurred: Unexpected server response: 409
andSecondary WebSocket error occurred: Unexpected server response: 409
errors as well. The behavior is basically identical to as described above.I've also observed this on
@slack/bolt-js
versions as early as3.20.0
.