filmaj commented 3 months ago

This PR addresses two, maybe three bugs, some of which are breaking changes (but I want to release a new major version of this library soon, so now is the time):

When calling disconnect() and already disconnected, do not raise an exception.
BREAKING CHANGE: When receiving a too_many_websockets type:disconnect message from Slack, do not raise it as a slack_eventlifecycle event. Also refactor the internal state machine events, combine the reason:warning and reason:refresh_requested type: disconnect messages into one event (since they are handled the same).

Previous Working Notes

Expanding the socket-mode integration tests as I try to help a Salesforce team get to the root of some rare exceptions they are seeing in their socket mode app. It seems if the backend returns "too many websockets" this package may get itself into an unrecoverable loop.

Question: what should Bolt / this package do if it receives a type:disconnect, reason:too_many_websockets message? Reconnecting is probably bad (as we could inadvertently cause a stampede of reconnecting->disconnecting events)?

Answer: probably the same as python Slack SDK and Java Slack SDK, which is to reconnect for all type:disconnect messages received from Slack backend.

I see the Python library has some mentions of too_many_websockets, and specifically calls out to close the previous connection first before opening a new one. I do not think node-slack-sdk socket-mode does this.. it does the opposite: creates a new connection first, and then closes the old one (based on this code).

Here is an excerpt of the production logs that I am trying to reproduce via these tests:

2024-03-19T07:54:32.276559+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Received a message on the WebSocket: {"type":"disconnect","reason":"warning","debug_info":{"host":"applink-9"}}
2024-03-19T07:54:32.276617+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Received "disconnect" (warning) message - creating the second connection
2024-03-19T07:54:32.276667+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Cancelled the job waiting for ping from Slack
2024-03-19T07:54:32.276717+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Terminated the heart beat job
2024-03-19T07:54:32.276717+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: reconnecting
2024-03-19T07:54:32.276740+00:00 app[app.1]: [INFO]  socket-mode:SocketModeClient:0 Reconnecting to Slack ...
2024-03-19T07:54:32.276881+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connecting
2024-03-19T07:54:32.277031+00:00 app[app.1]: [INFO]  socket-mode:SocketModeClient:0 Going to establish a new connection to Slack ...
2024-03-19T07:54:32.277065+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connecting:authenticating
2024-03-19T07:54:32.277084+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Going to retrieve a new WSS URL ...
2024-03-19T07:54:32.277128+00:00 app[app.1]: [DEBUG]  bolt-app apiCall('apps.connections.open') start
2024-03-19T07:54:32.277501+00:00 app[app.1]: [DEBUG]  bolt-app http request url: https://slack.com/api/apps.connections.open
2024-03-19T07:54:32.284315+00:00 app[app.1]: [DEBUG]  bolt-app http request body: {}
2024-03-19T07:54:32.284316+00:00 app[app.1]: [DEBUG]  bolt-app http request headers: {}
2024-03-19T07:54:32.313400+00:00 app[app.1]: [DEBUG]  bolt-app http response received
2024-03-19T07:54:32.313527+00:00 app[app.1]: [DEBUG]  bolt-app http request result: {"ok":true,"url":"wss://wss-primary.slack.com/link/?ticket=XXX&app_id=XXX","response_metadata":{"scopes":["connections:write"],"acceptedScopes":["connections:write"]}}
2024-03-19T07:54:32.313528+00:00 app[app.1]: [DEBUG]  bolt-app apiCall('apps.connections.open') end
2024-03-19T07:54:32.313529+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connecting:authenticated
2024-03-19T07:54:32.333581+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connecting:handshaking
2024-03-19T07:54:32.345322+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Received a message on the WebSocket: {"type":"disconnect","reason":"too_many_websockets","debug_info":{"host":"applink-1"}}
2024-03-19T07:54:32.346427+00:00 app[app.1]: [DEBUG]  bolt-app undefined
2024-03-19T07:54:32.347037+00:00 app[app.1]: [ERROR]   An unhandled error occurred while Bolt processed (type: undefined, error: TypeError: Cannot read properties of undefined (reading 'event'))
2024-03-19T07:54:32.347154+00:00 app[app.1]: [DEBUG]   Error details: TypeError: Cannot read properties of undefined (reading 'event'), retry num: undefined, retry reason: undefined
2024-03-19T07:54:32.348089+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: reconnecting
2024-03-19T07:54:32.348144+00:00 app[app.1]: [INFO]  socket-mode:SocketModeClient:0 Reconnecting to Slack ...
2024-03-19T07:54:32.348213+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connecting
2024-03-19T07:54:32.348214+00:00 app[app.1]: [INFO]  socket-mode:SocketModeClient:0 Going to establish a new connection to Slack ...
2024-03-19T07:54:32.348332+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connecting:authenticating
2024-03-19T07:54:32.348333+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Going to retrieve a new WSS URL ...
2024-03-19T07:54:32.348333+00:00 app[app.1]: [DEBUG]  bolt-app apiCall('apps.connections.open') start
2024-03-19T07:54:32.349712+00:00 app[app.1]: [DEBUG]  bolt-app http request url: https://slack.com/api/apps.connections.open
2024-03-19T07:54:32.349714+00:00 app[app.1]: [DEBUG]  bolt-app http request body: {}
2024-03-19T07:54:32.349714+00:00 app[app.1]: [DEBUG]  bolt-app http request headers: {}
2024-03-19T07:54:32.372319+00:00 app[app.1]: [DEBUG]  bolt-app http response received
2024-03-19T07:54:32.372493+00:00 app[app.1]: [DEBUG]  bolt-app http request result: {"ok":true,"url":"wss://wss-primary.slack.com/link/?ticket=XXX&app_id=XXX","response_metadata":{"scopes":["connections:write"],"acceptedScopes":["connections:write"]}}
2024-03-19T07:54:32.372495+00:00 app[app.1]: [DEBUG]  bolt-app apiCall('apps.connections.open') end
2024-03-19T07:54:32.372524+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connecting:authenticated
2024-03-19T07:54:32.388933+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connecting:handshaking
2024-03-19T07:54:32.401919+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Received a message on the WebSocket: {"type":"disconnect","reason":"too_many_websockets","debug_info":{"host":"applink-13"}}
2024-03-19T07:54:32.402230+00:00 app[app.1]: [DEBUG]  bolt-app undefined
2024-03-19T07:54:32.402614+00:00 app[app.1]: [ERROR]   An unhandled error occurred while Bolt processed (type: undefined, error: TypeError: Cannot read properties of undefined (reading 'event'))
2024-03-19T07:54:32.402697+00:00 app[app.1]: [DEBUG]   Error details: TypeError: Cannot read properties of undefined (reading 'event'), retry num: undefined, retry reason: undefined
2024-03-19T07:54:32.403304+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: reconnecting
2024-03-19T07:54:32.403352+00:00 app[app.1]: [INFO]  socket-mode:SocketModeClient:0 Reconnecting to Slack ...
2024-03-19T07:54:32.403458+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connecting
2024-03-19T07:54:32.403528+00:00 app[app.1]: [INFO]  socket-mode:SocketModeClient:0 Going to establish a new connection to Slack ...
2024-03-19T07:54:32.403610+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connecting:authenticating
2024-03-19T07:54:32.403661+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Going to retrieve a new WSS URL ...
2024-03-19T07:54:32.403725+00:00 app[app.1]: [DEBUG]  bolt-app apiCall('apps.connections.open') start
2024-03-19T07:54:32.404132+00:00 app[app.1]: [DEBUG]  bolt-app http request url: https://slack.com/api/apps.connections.open
2024-03-19T07:54:32.404196+00:00 app[app.1]: [DEBUG]  bolt-app http request body: {}
2024-03-19T07:54:32.404281+00:00 app[app.1]: [DEBUG]  bolt-app http request headers: {}
2024-03-19T07:54:32.424951+00:00 app[app.1]: [DEBUG]  bolt-app http response received
2024-03-19T07:54:32.425081+00:00 app[app.1]: [DEBUG]  bolt-app http request result: {"ok":true,"url":"wss://wss-primary.slack.com/link/?ticket=XXX&app_id=XXX","response_metadata":{"scopes":["connections:write"],"acceptedScopes":["connections:write"]}}
2024-03-19T07:54:32.425126+00:00 app[app.1]: [DEBUG]  bolt-app apiCall('apps.connections.open') end
2024-03-19T07:54:32.425230+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connecting:authenticated
2024-03-19T07:54:32.440719+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connecting:handshaking
2024-03-19T07:54:32.450793+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Received a message on the WebSocket: {"type":"disconnect","reason":"too_many_websockets","debug_info":{"host":"applink-14"}}
2024-03-19T07:54:32.450982+00:00 app[app.1]: [DEBUG]  bolt-app undefined
2024-03-19T07:54:32.451351+00:00 app[app.1]: [ERROR]   An unhandled error occurred while Bolt processed (type: undefined, error: TypeError: Cannot read properties of undefined (reading 'event'))
2024-03-19T07:54:32.451362+00:00 app[app.1]: [DEBUG]   Error details: TypeError: Cannot read properties of undefined (reading 'event'), retry num: undefined, retry reason: undefined

seratch commented 3 months ago

@filmaj Thanks for thorough investigation on this!

I see the Python library has some mentions of too_many_websockets, and specifically calls out to close the previous connection first before opening a new one

The main reason behind the comment is to prevent errors when an app scales out to 10 instances; the maximum number of allowed connections per app. When 10 app instances are already up and running, it's worth first closing a connection before starting a new one. That's the reason why Python SDK does so.

If the reported situation is exactly the same, aligning with Python SDK could be beneficial. However, if the root cause is different, we might need to apply other improvements as well.

Generally speaking, when an app encounters the too_many_websockets error, the only available resolution is to reduce the number of active connections. It's viable to resolve the situation solely through client-side SDK changes.

If this library does not properly close some connections when swtiching them, we can definitely improve the logic to prevent such occurrences. However, I haven't yet identified such a bug exists. I've refactored some parts of the Node library a while ago, but I didn't modify the reconnecting logic at all then. We may need to improve it to make it even more stable. I will also check it out when I have time for it.

filmaj commented 3 months ago

Thanks for your comments @seratch. I think this library closes connections cleanly, so I am not sure that is something we need to consider.

It makes sense to me to close the connection first before opening a new one if we have a max-connection constraint. Perhaps the fact the node library does not do this is simply an oversight from many years ago. In any case, given what you've said, it makes sense to me to make this change in this library.

I know it is a nervous proposition to make a significant behaviour change like that in this library, but at least the good news is that the library is now ready for a new major release (dropping EOL node versions support), so perhaps now is the time to make such a behaviour change? Additionally, with the existence of the new socket server integration tests, I have more confidence making behaviour changes as we can more fully test the state machine transitions for different, more complex sequences between socket peers.

filmaj commented 3 months ago

One more note here. The following excerpt from the above logs is from bolt-js - not from socket-mode - so the interaction is a bit complex to tease out:

2024-03-19T07:54:32.345322+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Received a message on the WebSocket: {"type":"disconnect","reason":"too_many_websockets","debug_info":{"host":"applink-1"}}
2024-03-19T07:54:32.346427+00:00 app[app.1]: [DEBUG]  bolt-app undefined
2024-03-19T07:54:32.347037+00:00 app[app.1]: [ERROR]   An unhandled error occurred while Bolt processed (type: undefined, error: TypeError: Cannot read properties of undefined (reading 'event'))

My guess is that socket-mode, when it receives a type:disconnect,reason:too_many_websockets actually continues to propagate the event up as a full slack_event that the socket-mode module exposes (see bottom of Lifecycle Events documentation). This is different behaviour from when it receives other kinds of type:disconnect messages; for these messages it will NOT propagate it up as a slack_event. I think this is also a mistake, and leads to the above error through this code path in bolt-js (and in particular through the catch clause), and specifically:

body, retryNum and retryReason are null or undefined when too_many_websockets gets raised as a slack_event. Everything except ack and type here would be null/undefined, I believe.
bolt attempts to process the raised slack_event from the socket mode module
in developer mode (which I'm pretty sure the app from which the above logs I posted come from has enabled), the body gets logged out. Bolt therefore logs out [DEBUG] bolt-app undefined
bolt tries to getTypeAndConversation on the undefined body
this method inspects a body.event subproperty as a first step. Since body is undefined, this leads to an exception...
.. and explains the last log we see above: Cannot read properties of undefined (reading 'event')

filmaj commented 2 months ago

@seratch I got further reports from customers that raising type:disconnect messages as slack_event can cause exceptions in their bolt app (as I described in this comment).

Question for you: do you think preventing the raising of type:disconnect messages as a slack_event would qualify as a major change? Perhaps it would be considered a patch change? I was thinking of reverting the Event enum name changes in this PR and just keeping:

no exception thrown when calling disconnect() and already disconnected, and
do not raise type:disconnect messages as a slack_event

.. and releasing a new patch release in the 1.3.x socket-mode line.

What do you think @seratch ?

I think this would help customers experiencing issues today while giving us space to work on the python-based port for the next major version.

seratch commented 2 months ago

@filmaj I don’t think it’s a major change in this context. So, we can ship it as a patch release!

slackapi / node-slack-sdk

socket-mode: do not throw exception when calling disconnect() and already disconnected; do not raise slack_event in case of type:disconnect messages #1762

Previous Working Notes