Closed filmaj closed 2 months ago
@filmaj Thanks for thorough investigation on this!
I see the Python library has some mentions of too_many_websockets, and specifically calls out to close the previous connection first before opening a new one
The main reason behind the comment is to prevent errors when an app scales out to 10 instances; the maximum number of allowed connections per app. When 10 app instances are already up and running, it's worth first closing a connection before starting a new one. That's the reason why Python SDK does so.
If the reported situation is exactly the same, aligning with Python SDK could be beneficial. However, if the root cause is different, we might need to apply other improvements as well.
Generally speaking, when an app encounters the too_many_websockets error, the only available resolution is to reduce the number of active connections. It's viable to resolve the situation solely through client-side SDK changes.
If this library does not properly close some connections when swtiching them, we can definitely improve the logic to prevent such occurrences. However, I haven't yet identified such a bug exists. I've refactored some parts of the Node library a while ago, but I didn't modify the reconnecting logic at all then. We may need to improve it to make it even more stable. I will also check it out when I have time for it.
Thanks for your comments @seratch. I think this library closes connections cleanly, so I am not sure that is something we need to consider.
It makes sense to me to close the connection first before opening a new one if we have a max-connection constraint. Perhaps the fact the node library does not do this is simply an oversight from many years ago. In any case, given what you've said, it makes sense to me to make this change in this library.
I know it is a nervous proposition to make a significant behaviour change like that in this library, but at least the good news is that the library is now ready for a new major release (dropping EOL node versions support), so perhaps now is the time to make such a behaviour change? Additionally, with the existence of the new socket server integration tests, I have more confidence making behaviour changes as we can more fully test the state machine transitions for different, more complex sequences between socket peers.
One more note here. The following excerpt from the above logs is from bolt-js - not from socket-mode - so the interaction is a bit complex to tease out:
2024-03-19T07:54:32.345322+00:00 app[app.1]: [DEBUG] socket-mode:SocketModeClient:0 Received a message on the WebSocket: {"type":"disconnect","reason":"too_many_websockets","debug_info":{"host":"applink-1"}}
2024-03-19T07:54:32.346427+00:00 app[app.1]: [DEBUG] bolt-app undefined
2024-03-19T07:54:32.347037+00:00 app[app.1]: [ERROR] An unhandled error occurred while Bolt processed (type: undefined, error: TypeError: Cannot read properties of undefined (reading 'event'))
My guess is that socket-mode, when it receives a type:disconnect,reason:too_many_websockets
actually continues to propagate the event up as a full slack_event
that the socket-mode module exposes (see bottom of Lifecycle Events documentation). This is different behaviour from when it receives other kinds of type:disconnect
messages; for these messages it will NOT propagate it up as a slack_event
. I think this is also a mistake, and leads to the above error through this code path in bolt-js (and in particular through the catch
clause), and specifically:
body
, retryNum
and retryReason
are null or undefined when too_many_websockets
gets raised as a slack_event
. Everything except ack
and type
here would be null/undefined, I believe.slack_event
from the socket mode modulebody
gets logged out. Bolt therefore logs out [DEBUG] bolt-app undefined
getTypeAndConversation
on the undefined
body
body.event
subproperty as a first step. Since body
is undefined
, this leads to an exception...Cannot read properties of undefined (reading 'event')
@seratch I got further reports from customers that raising type:disconnect
messages as slack_event
can cause exceptions in their bolt app (as I described in this comment).
Question for you: do you think preventing the raising of type:disconnect
messages as a slack_event
would qualify as a major change? Perhaps it would be considered a patch change? I was thinking of reverting the Event
enum name changes in this PR and just keeping:
disconnect()
and already disconnected, andtype:disconnect
messages as a slack_event
.. and releasing a new patch release in the 1.3.x socket-mode line.
What do you think @seratch ?
I think this would help customers experiencing issues today while giving us space to work on the python-based port for the next major version.
@filmaj I don’t think it’s a major change in this context. So, we can ship it as a patch release!
This PR addresses two, maybe three bugs, some of which are breaking changes (but I want to release a new major version of this library soon, so now is the time):
disconnect()
and already disconnected, do not raise an exception.too_many_websockets
type:disconnect
message from Slack, do not raise it as aslack_event
lifecycle event. Also refactor the internal state machine events, combine thereason:warning
andreason:refresh_requested
type: disconnect messages
into one event (since they are handled the same).Previous Working Notes
Expanding the socket-mode integration tests as I try to help a Salesforce team get to the root of some rare exceptions they are seeing in their socket mode app. It seems if the backend returns "too many websockets" this package may get itself into an unrecoverable loop.
Question: what should Bolt / this package do if it receives a
type:disconnect, reason:too_many_websockets
message? Reconnecting is probably bad (as we could inadvertently cause a stampede of reconnecting->disconnecting events)?Answer: probably the same as python Slack SDK and Java Slack SDK, which is to reconnect for all
type:disconnect
messages received from Slack backend.I see the Python library has some mentions of
too_many_websockets
, and specifically calls out to close the previous connection first before opening a new one. I do not think node-slack-sdk socket-mode does this.. it does the opposite: creates a new connection first, and then closes the old one (based on this code).Here is an excerpt of the production logs that I am trying to reproduce via these tests: