slackapi / bolt-js

A framework to build Slack apps using JavaScript
https://tools.slack.dev/bolt-js/
MIT License
2.75k stars 393 forks source link

Socket mode: Unhandled event 'server explicit disconnect' in state 'connecting' #2225

Closed okovpashko closed 2 weeks ago

okovpashko commented 2 months ago

@slack/bolt version

3.21.1

Your App and Receiver Configuration

e.g. const myApp = new App({ ... what options are you using? })

const app = new App({
  token: env.SLACK_BOT_TOKEN,
  socketMode: true,
  appToken: env.SLACK_APP_TOKEN,
  logger,
  logLevel: env.LOG_LEVEL,
  customRoutes,
  port: env.APP_PORT,
});

Node.js runtime version

v20.17.0

Steps to reproduce:

N/A

Expected result:

No unexpected disconnection errors

Actual result:

Regular disconnection errors:

Unhandled event 'server explicit disconnect' in state 'connecting'.

Here's the stack trace from Sentry

image

and the latest error history

image

Requirements

After upgrading Bolt.js from 3.19.0 to 3.21.1, we started receiving disconnection errors multiple times daily on various environments (prod and testing) of projects that use the Socket Mode. At least one of those projects didn't have any other logic and/or infrastructure changes.

It looks like there's an error was introduced in one of the latest updates. Could you please confirm the issue or suggest where to dig if it's on our side?

seratch commented 2 months ago

Hi @okovpashko, we're sorry for the disruption during the upgrade and thank you for taking the time to report this.

It seems there is an issue on the underlying @slack/socket-mode library side. To get more detailed information on the situation, could you share the outputs of grep "@slack/" package-lock.json for your project?

We're curious about the versions of the underlying @slack/ npm packages. You're probably using @slack/socket-mode@1.3.x, and indeed the pattern might be unhandled even with the latest version of the 1.3.x series: https://github.com/slackapi/node-slack-sdk/blob/%40slack/socket-mode%401.3.6/packages/socket-mode/src/SocketModeClient.ts#L267-L283

One thing I am wondering is if this issue might be a potential issue even with bolt-js 3.19 because there is no specific code on the bolt-js side to reconnect for Socket Mode disconnections. I'm not sure about the direct trigger that caused the new error pattern for your app yet, but upgrading @slack/socket-mode etc. this time might be a possible cause. To resolve this issue, @slack/socket-mode 1.3.x should make a patch release to enable it functional with this kind of race condition situation.

Aside from that, our team has a long-term plan to rewrite the Socket Mode client for Node.js users to eliminate these kinds of errors, and the underlying module 2.x is already published, although bolt-js does not support it yet. I don't have detailed information about the future release plans at this moment, but @filmaj will be able to share more details on this.

okovpashko commented 2 months ago

@seratch thank you for the rapid response.

Here's my grep output:

         "@slack/bolt": "^3.21.1",
    "node_modules/@slack/bolt": {
      "resolved": "https://registry.npmjs.org/@slack/bolt/-/bolt-3.21.1.tgz",
        "@slack/logger": "^4.0.0",
        "@slack/oauth": "^2.6.3",
        "@slack/socket-mode": "^1.3.6",
        "@slack/types": "^2.11.0",
        "@slack/web-api": "^6.12.1",
    "node_modules/@slack/bolt/node_modules/@slack/logger": {
      "resolved": "https://registry.npmjs.org/@slack/logger/-/logger-4.0.0.tgz",
    "node_modules/@slack/bolt/node_modules/path-to-regexp": {
    "node_modules/@slack/logger": {
      "resolved": "https://registry.npmjs.org/@slack/logger/-/logger-3.0.0.tgz",
    "node_modules/@slack/oauth": {
      "resolved": "https://registry.npmjs.org/@slack/oauth/-/oauth-2.6.3.tgz",
        "@slack/logger": "^3.0.0",
        "@slack/web-api": "^6.12.1",
    "node_modules/@slack/socket-mode": {
      "resolved": "https://registry.npmjs.org/@slack/socket-mode/-/socket-mode-1.3.6.tgz",
        "@slack/logger": "^3.0.0",
        "@slack/web-api": "^6.12.1",
    "node_modules/@slack/types": {
      "resolved": "https://registry.npmjs.org/@slack/types/-/types-2.11.0.tgz",
    "node_modules/@slack/web-api": {
      "resolved": "https://registry.npmjs.org/@slack/web-api/-/web-api-6.12.1.tgz",
        "@slack/logger": "^3.0.0",
        "@slack/types": "^2.11.0",
    "node_modules/@slack/web-api/node_modules/eventemitter3": {
    "node_modules/@slack/web-api/node_modules/is-stream": {
    "@slack/bolt": {
      "resolved": "https://registry.npmjs.org/@slack/bolt/-/bolt-3.21.1.tgz",
        "@slack/logger": "^4.0.0",
        "@slack/oauth": "^2.6.3",
        "@slack/socket-mode": "^1.3.6",
        "@slack/types": "^2.11.0",
        "@slack/web-api": "^6.12.1",
        "@slack/logger": {
          "resolved": "https://registry.npmjs.org/@slack/logger/-/logger-4.0.0.tgz",
    "@slack/logger": {
      "resolved": "https://registry.npmjs.org/@slack/logger/-/logger-3.0.0.tgz",
    "@slack/oauth": {
      "resolved": "https://registry.npmjs.org/@slack/oauth/-/oauth-2.6.3.tgz",
        "@slack/logger": "^3.0.0",
        "@slack/web-api": "^6.12.1",
    "@slack/socket-mode": {
      "resolved": "https://registry.npmjs.org/@slack/socket-mode/-/socket-mode-1.3.6.tgz",
        "@slack/logger": "^3.0.0",
        "@slack/web-api": "^6.12.1",
    "@slack/types": {
      "resolved": "https://registry.npmjs.org/@slack/types/-/types-2.11.0.tgz",
    "@slack/web-api": {
      "resolved": "https://registry.npmjs.org/@slack/web-api/-/web-api-6.12.1.tgz",
        "@slack/logger": "^3.0.0",
        "@slack/types": "^2.11.0",

I took a look at the Git history in our project and noticed that the @slack/socket-mode package was updated from 1.3.3 to 1.3.6 along with the @slack/bolt. Probably it might be the cause.

Anyway, your reply is very helpful as now I understand that it's not the issue on our side, you are aware of it and working on the fix.

filmaj commented 2 months ago

Hello, this is unfortunately a design flaw with socket-mode 1.x. It is a duplicate of https://github.com/slackapi/node-slack-sdk/issues/1787. You can find the details for this problem in that issue thread.

I don't think this is a problem that was introduced between socket-mode 1.3.3 and 1.3.6, but you can try locking to the earlier version to see if the problem goes away on your end. Several customers started reporting issues related to this functionality in May of this year, which prompted a long investigation. Ultimately, socket-mode v2 was released that completely changes the design of the underlying socket-mode module and has addressed the issue for affected customers. You can give socket-mode v2 a try by using bolt@4.0.0-rc.1 - but fair warning, it is still in release candidacy as we are still some distance away from having a 4.0 of bolt-js ready.

If possible, if you have debug logs available leading up to the problem to share, please post them in https://github.com/slackapi/node-slack-sdk/issues/1787 - I am collecting such behaviours and trying to add them to the integration tests in socket-mode to write more realistic test scenarios to better help prevent regressions. Understanding the state of the state machine in socket-mode and the sequence of events sent by Slack's backend would be helpful.

okovpashko commented 2 months ago

@filmaj thank you for the explanation. I posted the debug logs to the issue you mentioned.

okovpashko commented 2 months ago

I've also just noticed the disconnection reason too_many_websockets. I'm wondering if there is an issue with for example not closing the previous connections on our end. We run bots in Kubernetes and I'm sure that we kill all the obsolete containers in the namespace.

filmaj commented 2 months ago

I've also just noticed the disconnection reason too_many_websockets. I'm wondering if there is an issue with for example not closing the previous connections on our end.

Ah yes, definitely leftover / hanging network connections not cleanly closed would contribute to the backend sending too_many_websockets.

Some context: each Slack app ID is allowed a maximum of 10 concurrent websocket (socket-mode) connections. Some customers sometimes create too many app instances (scale their app beyond 10 instances) which can lead to this problem. Possibly, as you mention, not explicitly stopping the server or cleanly closing the network connection could, at least temporarily, cause the error as well (until the network connection becomes severed or the Slack backend realizes the connection is no longer established).

When Slack detects it has more than 10 concurrent websocket connections open with a single app, it will send a type: warning message to the app with reason: too_many_websockets. By default, all socket-mode implementations that we at Slack manage (i.e. node @slack/socket-mode but also the Python and Java versions) will accept this message from the server and try to re-establish a connection with Slack. That is what "event 'server explicit disconnect'" means in the error message: Slack sent a type:warning message to the app, and the app is trying to re-establish a connection. This could lead to an error loop behaviour: Slack sends a type:warning or type:disconnect message, socket-mode tries to re-establish a connection, the app is already at the maximum concurrent number of connections allowed, and so fails.

Ultimately, I suggest looking into the reason that Slack is sending too_many_websockets: that is the actual problem; the 'explicit server disconnect' exception is just a symptom of that problem.

Thank you for sharing logs in the other issue, I will inspect them and see if that can help me re-create a test scenario; once a test scenario is written, then it is much easier to experiment with different solutions. ❤️

filmaj commented 1 month ago

FYI a release candidate with this fix is available in @slack/bolt@4.0.0-rc.2. It is a pre-release, however, so you will need to explicitly install that version, and since it is a breaking change, it is included with other breaking changes. However, there is a migration guide available to help guide you through the changes requiring other modifications to your app.

filmaj commented 2 weeks ago

The fix here is available in the new major bolt v4 version, FYI.