Sygnal does not recover cleanly if it builds up a queue of messages

matrix-org / sygnal

Sygnal: reference Push Gateway for Matrix

Apache License 2.0

166 stars 148 forks source link

Sygnal does not recover cleanly if it builds up a queue of messages #144

Open michaelkaye opened 4 years ago

michaelkaye commented 4 years ago

By inspection, sygnal runs into problems when it builds a queue of messages (for whatever reason).

We should more aggressively shed the load queued messages to stop this needing manual intervention.

The flow as I see it from an incident today:

t=0 - high response times from remote server causing a queue to build up t=10 - response times drop to normal levels, queue has built up to > 2k items t=15 - queue continues to build but response times are back to that earlier t=20 - restart sygnal - backlog of requests is dropped, retries cause another spike in load and increase in response time, but no queue builds up. Once messages are cleared all metrics return to normal.

It seems to me that the queue itself is causing the performance issues. A request that is more than 60s old is likely to have been thrown away by whatever http request servlets are in the flow, or rejected by a loadbalancer or proxy, so we should discard them and not attempt to process them.

erikjohnston commented 4 years ago

This is for APNs. Have some graphs:

Selection_026

(This is https://grafana.matrix.org/d/rg3tCbWWk/sygnal?orgId=1&from=1596660536006&to=1596661385527&var-bucket_size=$__auto_interval_bucket_size&var-instance=corus.priv.matrix.org:7564 for those with access)

erikjohnston commented 4 years ago

We are running v0.7.1, which is out of date. Let's update it.

michaelkaye commented 4 years ago

Looking better now; we're cleanly recovering for this failure.

It still causes up to 50% of pushes to fail back to synapse during this period:

only_one_pusher snappy

Interestingly, this is only happening on one of the two synapse pushers - perhaps something to investigate there.

michaelkaye commented 4 years ago

(times are off because one is UTC one is BST)

erikjohnston commented 3 years ago

One potential cause to look into is if we're correctly timing out connections/requests. If we have a fixed connection pool then its possible all connections get stuck for an age.

michaelkaye commented 3 years ago

This failed again today, and seems to have failed earlier in the week as well.

If we're not going to spend time addressing this issue, could someone clean up this PR so we can at least not page anyone over it:

https://github.com/matrix-org/sygnal/pull/new/michaelkaye/crash_dont_page

There's no point waking people up to restart a service - it should just exit and allow systemd to restart.

callahad commented 3 years ago

This seems like it might take a while to properly diagnose and debug, so we're inclined to merge the crash_dont_page as a near-term fix, but with it off by default.

clokep commented 3 years ago

FYI the graphs above seem based on the sygnal_active_apns_requests metric. It would be interesting to see the sygnal_inflight_request_limit_drop for the pushkin when this happens since it is supposed to be limited to 512 in-flight requests at once.

callahad commented 3 years ago

@michaelkaye Could you please get the CI to pass on #163 then ping us again to get it reviewed and merged?

michaelkaye commented 3 years ago

@callahad could someone who's a python developer take it over then - the PR was opened as an example of the change I'd like to see, I don't know if there's any edge cases or best practices around python development here, or if there's any twisted changes.

Please don't block this bugfix on me - i don't want to spend time roundtripping for reviews when all I'd be doing is learning twisted.

michaelkaye commented 3 years ago

(closing PR as it was example code only)