monzo / response

Monzo's real-time incident response and reporting tool ⚡️
MIT License
1.52k stars 165 forks source link

Rate limiting issues with users_list call #234

Open Limess opened 3 years ago

Limess commented 3 years ago

We're seeing issues starting up services due to rate limiting by slack on the users listing calls.

The current retry behaviour is fairly aggressive and retries 10 times with an initial delay of 0.2 seconds.

Slack send a Retry-After header when rate-limiting and suggest their rate limiting is minutely, but that header is not accessible using the api_call method on the slackclient library (at least in v1, I haven't checked v2).

Ideally this would respect that header and retry after a minute or so.

Logs

On startup:

2020-11-19T13:38:46 - default/backend/050d686b89ce41e98c5aeac3e42fea29 - WARNING:root:Retrying request to users.list after error ratelimited. Backing off 0.20s (attempt 1 of 10)
2020-11-19T13:38:46 - default/backend/050d686b89ce41e98c5aeac3e42fea29 - WARNING:root:Retrying request to users.list after error ratelimited. Backing off 0.20s (attempt 1 of 10)
2020-11-19T13:38:46 - default/backend/050d686b89ce41e98c5aeac3e42fea29 - WARNING:root:Retrying request to users.list after error ratelimited. Backing off 0.40s (attempt 2 of 10)
2020-11-19T13:38:46 - default/backend/050d686b89ce41e98c5aeac3e42fea29 - WARNING:root:Retrying request to users.list after error ratelimited. Backing off 0.40s (attempt 2 of 10)
2020-11-19T13:38:47 - default/backend/050d686b89ce41e98c5aeac3e42fea29 - WARNING:root:Retrying request to users.list after error ratelimited. Backing off 0.60s (attempt 3 of 10)
2020-11-19T13:38:47 - default/backend/050d686b89ce41e98c5aeac3e42fea29 - WARNING:root:Retrying request to users.list after error ratelimited. Backing off 0.60s (attempt 3 of 10)
2020-11-19T13:38:48 - default/backend/050d686b89ce41e98c5aeac3e42fea29 - WARNING:root:Retrying request to users.list after error ratelimited. Backing off 0.80s (attempt 4 of 10)
2020-11-19T13:38:48 - default/backend/050d686b89ce41e98c5aeac3e42fea29 - WARNING:root:Retrying request to users.list after error ratelimited. Backing off 0.80s (attempt 4 of 10)
2020-11-19T13:38:49 - default/backend/050d686b89ce41e98c5aeac3e42fea29 - WARNING:root:Retrying request to users.list after error ratelimited. Backing off 1.00s (attempt 5 of 10)
2020-11-19T13:38:50 - default/backend/050d686b89ce41e98c5aeac3e42fea29 - WARNING:root:Retrying request to users.list after error ratelimited. Backing off 1.00s (attempt 5 of 10)
2020-11-19T13:38:50 - default/backend/050d686b89ce41e98c5aeac3e42fea29 - WARNING:root:Retrying request to users.list after error ratelimited. Backing off 1.20s (attempt 6 of 10)
2020-11-19T13:38:51 - default/backend/050d686b89ce41e98c5aeac3e42fea29 - WARNING:root:Retrying request to users.list after error ratelimited. Backing off 1.20s (attempt 6 of 10)
2020-11-19T13:38:51 - default/backend/050d686b89ce41e98c5aeac3e42fea29 - WARNING:root:Retrying request to users.list after error ratelimited. Backing off 1.40s (attempt 7 of 10)
2020-11-19T13:38:52 - default/backend/050d686b89ce41e98c5aeac3e42fea29 - WARNING:root:Retrying request to users.list after error ratelimited. Backing off 1.40s (attempt 7 of 10)
2020-11-19T13:38:53 - default/backend/050d686b89ce41e98c5aeac3e42fea29 - WARNING:root:Retrying request to users.list after error ratelimited. Backing off 1.60s (attempt 8 of 10)
2020-11-19T13:38:55 - default/backend/050d686b89ce41e98c5aeac3e42fea29 - WARNING:root:Retrying request to users.list after error ratelimited. Backing off 1.80s (attempt 9 of 10)
2020-11-19T13:38:57 - default/backend/050d686b89ce41e98c5aeac3e42fea29 - WARNING:root:Retrying request to users.list after error ratelimited. Backing off 2.00s (attempt 10 of 10)

During this time it seems the server is not available, which with large backoffs also result in our healthcheck killing the starting service after some timeout.

Limess commented 3 years ago

On further investigation, it seems this is only on startup, prior to the server starting, due to INCIDENT_BOT_ID not being set. We've now set this and it has helped alleviate this issue as it avoids that initial lookup.