Rate-limiting while trying to build RealTime::Stores::Store cache leads to keep_alive? failures

milestruecar commented 1 year ago

https://github.com/slack-ruby/slack-ruby-client/blob/bf5bb4be3e03e9bd74d22942a61d5611f5c1d997/lib/slack/real_time/stores/store.rb#L328-L378

on initial connect, while cycling through users and conversations in this hello method, we're getting rate-limited and receiving 429 responses from slack. it looks like there's retry logic somewhere that's working, but somewhere along the way, before the cache actually completes (but never in a consistent spot), the connection just kinda dies, followed by another rtm.connect and another attempt to build the cache, which leads to more rate limiting, then another disconnect/reconnect, and so on forever.

after messing with the code locally, in this keep_alive? method https://github.com/slack-ruby/slack-ruby-client/blob/bf5bb4be3e03e9bd74d22942a61d5611f5c1d997/lib/slack/real_time/client.rb#L100-L116 while we're getting 429ed, the @socket.time_since_last_message eventually grows beyond (websocket_ping * 2) and knocks everything over. After adding some sleep calls to the cache build to try to mitigate the 429s and then disabling that keep_alive? entirely, the cache finished building and the app finally went into a steady state.

i had some ideas for possible fixes (longer timeout threshold if the last response was a 429 which i'm not entirely sure how to interrogate inside that keep_alive? method, just disabling keep_alive? until after the cache finishes which seems... not ideal, or adding sleeps to the user and conversation loops in hello method which would probably be silly for people with smaller slack footprints) but none of them seem great, and i'm not exactly a socket expert. so i wanted some guidance on what would make sense to y'all before i put together a PR

milestruecar commented 1 year ago

while fiddling with this to try to find some short term solution, i set a higher websocket_ping time but noticed that even when we aren't getting rate-limited, socket.time_since_last_message just keeps going up. sample output from local branch with some extra logging thrown in

[2023-06-26T16:05:30.778-07:00] INFO: POST https://slack.com/api/conversations.list
[2023-06-26T16:05:30.778-07:00] INFO: Accept: "application/json; charset=utf-8"
User-Agent: "Slack Ruby Client/2.1.1"
Authorization: [token]
Content-Type: "application/x-www-form-urlencoded"
[2023-06-26T16:05:30.864-07:00] INFO: @socket connected? true
[2023-06-26T16:05:30.864-07:00] INFO: time since last message: 749.5803269999997
[2023-06-26T16:05:30.864-07:00] INFO: websocket ping: 300
[2023-06-26T16:05:30.864-07:00] WARN: is offline
[2023-06-26T16:05:30.865-07:00] INFO: POST https://slack.com/api/rtm.connect
[2023-06-26T16:05:30.865-07:00] INFO: Accept: "application/json; charset=utf-8"
User-Agent: "Slack Ruby Client/2.1.1"
Authorization: [token]
[2023-06-26T16:05:31.117-07:00] INFO: Status 200

dblock commented 1 year ago

This is quite interesting, thank you for digging deep into this.

The RTM API tracks messages from slack, so socket.time_since_last_message will just keep going up until you get something meaningful from Slack on the RTM connection. Is building the cache blocking those messages? I think that's the real bug and a proper fix would be to ensure that the local cache is built asynchronously and isn't blocking messages. As an experiment throw a very long sleep instead of this code that makes requests and see the same behavior?

dblock commented 1 year ago

I think https://github.com/slack-ruby/slack-ruby-client/issues/237 is related.

milestruecar commented 1 year ago

yeah same behavior with long sleep, cache build is blocking

milestruecar commented 1 year ago

opened a PR, seemed to work out well locally. unblocks the socket.time_since_last_message at least. Not sure if there are any potential side effects but seemed pretty straightforward.

dblock commented 1 year ago

@milestruecar Nice. Let's finish the work in the PR.

dblock commented 11 months ago

Closing via https://github.com/slack-ruby/slack-ruby-client/pull/486

slack-ruby / slack-ruby-client

Rate-limiting while trying to build RealTime::Stores::Store cache leads to keep_alive? failures #485