WebSocket error occurred: Unexpected server response: 409

gdhardy1 commented 3 weeks ago

Socket mode receiver fails to make a successful connection due to WebSocket error occurred: Unexpected server response: 409 error. Retries after this error occurs may eventually lead to a successful connection or worse, rate limiting from Slack servers on the apps.connections.open call.

`@slack/bolt` version

4.0.1

Your `App` and Receiver Configuration

Socket mode 2.0.2

Node.js runtime version

v20.17.0

Steps to reproduce:

This error seems to reproduce only when launching in a distributed environment (i.e. launching 2 or more instances of the app). Nothing special is being done on start up. It is minimally reproducible using the example from BoltJS Getting Started docs as a basis.

const app = new App({
  socketMode: true,
  token: process.env.BOT_TOKEN,
  signingSecret: process.env.SIGNING_SECRET,
  appToken: process.env.APP_TOKEN,
  logger
});

void (async () => {
  // Start your app
  await app.start(Number(process.env.PORT) || 3000);

  console.log('⚡️ Bolt app is running!');
})();

Open multiple shell terminals (suggested 5) and change process.env.PORT to a different number in each shell
Quickly start the bolt app in each shell (roughly simultaneously to mimic deployment as a server group)

Expected result:

Each instance of the app should start up successfully without the WebSocket error occurred: Unexpected server response: 409 error

Actual result:

First we see the Socket Mode client successfully initialized.

[14:05:55.636] DEBUG (36123): Initializing SocketModeReceiver
[14:05:55.636] DEBUG (36123): The logLevel given to Socket Mode was ignored as you also gave logger
[14:05:55.637] DEBUG (36123): The logLevel given to WebClient was ignored as you also gave logger
[14:05:55.637] DEBUG (36123): initialized
[14:05:55.637] DEBUG (36123): The Socket Mode client has successfully initialized
[14:05:55.637] DEBUG (36123): apiCall('auth.test') start
[14:05:55.638] DEBUG (36123): http request url: https://slack.com/api/auth.test
[14:05:55.638] DEBUG (36123): http request body: {"token":"[[REDACTED]]"}
[14:05:55.638] DEBUG (36123): http request headers: {"Accept":"application/json, text/plain, */*","User-Agent":"@slack:socket-mode/2.0.2 @slack:bolt/4.0.1 @slack:web-api/7.7.0 node/20.17.0 linux/6.8.0-1017-aws","Authorization":"[[REDACTED]]"}

Then we see the 409 error and the following logs repeat over and over again until:

the connection is established OR

the app is rate limited

[14:05:55.639] DEBUG (36123): Starting Socket Mode session ...
[14:05:55.639] DEBUG (36123): Going to retrieve a new WSS URL ...
[14:05:55.639] DEBUG (36123): apiCall('apps.connections.open') start
[14:05:55.639] DEBUG (36123): http request url: https://slack.com/api/apps.connections.open
[14:06:46.734] DEBUG (36239): http request body: {}
[14:06:46.734] DEBUG (36239): http request headers: {"Accept":"application/json, text/plain, */*","User-Agent":"@slack:socket-mode/2.0.2 @slack:bolt/4.0.1 @slack:web-api/7.7.0 node/20.17.0 linux/6.8.0-1017-aws","Authorization":"[[REDACTED]]"}
[14:06:46.755] DEBUG (36239): http response received
[14:06:46.755] DEBUG (36239): http request result: {"ok":true,"url":"wss://wss-primary.slack.com/link/?ticket=084fad38-f4fe-4fba-9057-fedc6792738f&app_id=4cd5f1068f00f78fe83741337c42c958d96b3b3b13a7e17b4cda776e289e362c","response_metadata":{"scopes":["connections:write"],"acceptedScopes":["connections:write"]}}
[14:06:46.755] DEBUG (36239): apiCall('apps.connections.open') end
[14:06:46.755] DEBUG (36239): Initiating new WebSocket connection.
[14:06:46.768] ERROR (36239): WebSocket error occurred: Unexpected server response: 409
[14:06:46.768] DEBUG (36239): WebSocket already disconnected, flushing remainder.
[14:06:46.768] DEBUG (36239): Before trying to reconnect, this client will wait for 5000 milliseconds
[14:06:46.768] ERROR (36239): WebSocket error! Error: Failed to send message on websocket: Unexpected server response: 409
[14:06:46.768] DEBUG (36239): WebSocket close frame received (code: 1006, reason: )
[14:06:46.768] DEBUG (36239): WebSocket already disconnected, flushing remainder.
[14:06:46.768] DEBUG (36239): Before trying to reconnect, this client will wait for 10000 milliseconds
[14:06:46.779] DEBUG (36239): Continuing with reconnect...

Additional Notes

It seems that a successfully established connection does not mean the app is in a "safe" state. While running, I've observed both Primary WebSocket error occurred: Unexpected server response: 409 and Secondary WebSocket error occurred: Unexpected server response: 409 errors as well. The behavior is basically identical to as described above.

I've also observed this on @slack/bolt-js versions as early as 3.20.0.

gdhardy1 commented 3 weeks ago

Thanks in advance for investigating. I'm happy to provide any additional information. I'd appreciate any temporary workaround ideas in the short term. Being unexpectedly rate limited during business hours would be a huge disruption as the app where I experienced this is in the critical path for users seeking engineering support from my organization.

gdhardy1 commented 2 weeks ago

I found some time to make a reproduction experiment for what I observed.

The issue seems to lie within in @slack/socket-mode.

When the SlackWebSocket encounters this 409 error, it calls to disconnect then it calls to terminate
This in turn emits a close event to the SocketModeClient and the client is started again

I see that the SocketModeClient is internal to SocketModeReceiver,autoReconnectEnabled is true by default, and this option is not configureable via AppOptions or SocketModeReceiverOptions. Additionally, isRecoverable is true for rate limited errors.

Given this, it seems that once a rate limit error is encountered, there isn't a way to recover and the call to apps.connections.open will be retried indefinitely.

filmaj commented 2 weeks ago

Hey @gdhardy1 , thank you so much for the excellent reproduction materials 🙏 It makes it so much easier to try to help when the problem is clearly demonstrated ❤️

I acknowledge that it is frustrating to see this affecting your business app. Let's see if we can figure out a workaround or how we should go about getting this fixed.

A few things that stand out to me immediately:

The Slack tools, be it @slack/socket-mode or @slack/web-api, should be honouring any Retry-After headers in terms of reconnection attempts or rate limiting - are you not seeing that happen? Do you happen to have any debug logs for the 429 response?
If you did have access to any of the options you linked to, be it for socket-mode or bolt AppOptions or SocketModeReceiverOptions, would that help you in the short-term? If so, that's an easy fix/feature we can release pretty quickly to, at least, stem the bleeding.

I wonder what the proper behaviour here is. It feels to me like it would be reasonable to have the socket connection attempt to be re-established in a (presumably?) temporary issue like an HTTP 409.

Do you have any other thoughts on what expected behaviour should be in the unfortunate circumstance that socket endpoint is sending 409s?

gdhardy1 commented 2 weeks ago

The Slack tools, be it @slack/socket-mode or @slack/web-api, should be honouring any Retry-After headers in terms of reconnection attempts or rate limiting - are you not seeing that happen?

Not exactly. I believe there are two issues that lead to multiple retries being executed simultaneously.

Issue 1

When the socket connection fails with the 409 error, both the error and close events are emitted from the WebSocket

This leads to SocketModeClient calling delayReconnectAttempt twice for a single failure

the first at this.clientPingTimeoutMS * this.numOfConsecutiveReconnectionFailures and
the second at this.clientPingTimeoutMS * this.numOfConsecutiveReconnectionFailures + 1.

This happens because both the error and close event handlers call SocketModeClient.disconnect

To avoid this, SocketModeClient should check if a retry is pending before calling delayReconnectAttempt.

Issue 2

numOfConsecutiveReconnectionFailures is reset when apps.connections.open succeeds.

This can lead to retries being scheduled earlier than they should.

Consider the scenario where:

the socket connection fails and two retries are scheduled as described in Issue 1 then
the first retry starts but
- the apps.connections.open succeeds and
- numOfConsecutiveReconnectionFailures set to 0`
- the socket connection fails
  - two more retries scheduled

The retry timing will be:	Time (ms)	Event
0	`apps.connections.open` succeeds; socket connect fails
5000	Retry 1 starts; `apps.connections.open` succeeds; socket connection fails
10000	Retry 2 starts AND Retry 3 starts
15000	Retry 4 starts

You can replicate this by running the experiment with MAX_CALLS=4.

I suspect that either of following would be preferable:

resetting numOfConsecutiveReconnectionFailures only when the socket connection is actually established OR
having separate delayReconnectAttempt retry mechanisms for the apps.connections.open call and WebSocket connection failure modes

However, that might not be necessary as long as Issue 1 is addressed.

Answers to other follow up questions.

Do you happen to have any debug logs for the 429 response?

I wasn't able to capture debug level logs when I encountered this in production. But the reproduction experiment is pretty close if not identical I think.

If you did have access to any of the options you linked to, be it for socket-mode or bolt AppOptions or SocketModeReceiverOptions, would that help you in the short-term?

Actually, after trying, I don't think so. I thought setting autoReconnectEnabled to false in conjunction with rejectRateLimitedCalls on the WebClient might help me catch and implement my own retry logic but that seemed like more effort than just fixing the root problem

filmaj commented 2 weeks ago

Thanks for the analysis. I've moved this issue over to the node-slack-sdk (the home of the socket-mode client); will sift through everything you've posted and see if I can capture the issue as a test to help frame the discussion.

Separate issue: I think I will riff on your repro a bit to add Retry-After header responses to the apps.connections.open endpoint to see if retry handling works as expected. Will see what kind of responses that API raises in a rate limit scenario.

bitofsky commented 2 weeks ago

Hello, I recently encountered the same issue.

When running two instances of a Bolt app, one of the apps would unexpectedly disconnect from the socket connection. During the reconnection process, it would fail repeatedly, doubling the reconnection attempts each time. This cycle continued until it led to an Out of Memory error, eventually causing the app to crash.

[ERROR]  socket-mode:SocketModeClient:0 WebSocket error! Error: Failed to send message on websocket: Unexpected server response: 409
[ERROR]  socket-mode:SlackWebSocket:1843 WebSocket error occurred: Unexpected server response: 409
[WARN]  bolt-app http request failed A rate limit was exceeded (url: apps.connections.open, retry-after: 2)

filmaj commented 1 week ago

Update here: I have escalated this issue to the real-time server backend team to raise the issue related to unexpected 409 responses from that server.

I am still working on addressing the issue of what this socket-mode library should do in this situation and how to prevent it from getting into a rate-limit situation with the apps.connections.open API.

matthewhenry1 commented 1 week ago

@bitofsky upgrading to @slack/bolt 4.1.0 should resolve the Out of Memory error you're experiencing. I experienced the same Out of Memory error running 3.21.4

bitofsky commented 1 week ago

@matthewhenry1 Thanks for the response. I was running on @slack/bolt 4.0.1. I'll upgrade to 4.1.0 and test if the issue persists.

filmaj commented 1 week ago

The underlying issue still exists in bolt 4.x, the only thing upgrading to 4.x may help is if you experience crashes with a class of errors in the form "unhandled event X in state Y", e.g. "Unhandled event 'server explicit disconnect' in state 'connecting'." (as described in this comment).

alisd23 commented 1 week ago

Just to add some more info: I'm also experiencing this issue in a distributed setup, and I'm also sharing app tokens across environments (local, staging, live) which is maybe a bad idea. I'm consistently getting 409 errors. Looking at the logs it seemingly starts with a random 409 response, then ends up in an out of control (seemingly exponential) chain reaction of reconnections - as others have seen. I might have to remove the integration as it's spamming our logs (and CPU resource) so much.

Given it's probably a distributed problem - could it be some connection limit issue? The original example of this issue uses 5 instances, which is less than the max 10 connection limit shown here: https://api.slack.com/apis/socket-mode#connections. Could this limit be the cause of the 409 errors, and could this limit be wrong in the docs - i.e. lower than it mentions? I'm fairly sure I don't have 10 instances in total at any time though.

Related question: Does anyone know if this limit is per app token or per app?

I'm using latest version 4.1.0.

bitofsky commented 1 week ago

@matthewhenry1 Thanks for the response. I was running on @slack/bolt 4.0.1. I'll upgrade to 4.1.0 and test if the issue persists.

Issues are still occurring in version 4.1.0.

filmaj commented 6 days ago

This does seem to be an infrastructural issue; from what I understand, the 409 response should be an internal-only signal for the real-time backend system to find another container to serve the connection request. For whatever reason (a recent change or bug), it is bubbling up to the public / to the apps. The infrastructure team is still investigating the cause.

dschneck-ibm commented 3 days ago

IBM is hitting this issue attempting to upgrade a major Slack application. Our only workaround was to decrease the number of instances of the application to 1. That is an unacceptable workaround. The reason we are upgrading is because of Unhandled event 'server explicit disconnect causing frequent, unexpected restarts of all the app instances. How can we proceed if web socket 1.x is a problem and web socket 2.x is a problem? We need fixes or workarounds.

gdhardy1 commented 3 days ago

We just experienced another outage today. Luckily, we had alerting set up so that we could mitigate within roughly 10-15 minutes. But honestly, even that amount of downtime is tremendously painful for our users.

In case it's helpful with isolating a change that introduced the bug, we started observing the 409 errors on Oct 28th or 29th. Before that time, these errors virtually never occurred.

What makes this especially painful is that recovery pretty much requires human intervention and downtime is guaranteed:

The first 409 error usually results in the same errors on other nodes
- this leads to unresponsiveness first, then eventually rate limiting
When launching replacement nodes, they often come up in a bad state
- Manually restarting "bad nodes" individually, yields the same result
We had to coordinate killing/restarting all nodes simultaneously to recover successfully from the rate limiting

yichennnn36 commented 3 days ago

I also encountered the same issue. Because I have two instances of a Bolt app. How could I resolve this problem? or how could I detect error and disconnected the Bolt app

[ERROR]  socket-mode:SocketModeClient:0 WebSocket error! Error: Failed to send message on websocket: Unexpected server response: 409
[ERROR]  socket-mode:SlackWebSocket:1843 WebSocket error occurred: Unexpected server response: 409
[WARN]  bolt-app http request failed A rate limit was exceeded (url: apps.connections.open, retry-after: 2)

filmaj commented 2 days ago

Update: the backend team is still investigating the 409 issue. There has been some progress but it still being worked on.

I have made some progress on mitigating the double-reconnect logic identified by @gdhardy1 ("Issue 1" in their comment here). Will share a draft PR soon.

Still a few other things need to be untangled to make more meaningful progress: how should the socket library react to a 409 generally speaking? Should reconnects be disabled altogether, even if the consumer specifically turned reconnections on? Additionally, proper handling of WS endpoint reconnection rate limiting needs to be explored.

filmaj commented 2 days ago

I have a draft PR that at least depicts the issue in an integration test in this repo: https://github.com/slackapi/node-slack-sdk/pull/2099

slackapi / node-slack-sdk