matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.82k stars 2.13k forks source link

Investigate why `/messages` is slow #13356

Open MadLittleMods opened 2 years ago

MadLittleMods commented 2 years ago

Background

For Matrix Public Archive, we're getting a lot of 504 timeouts waiting for Matrix API requests to finish. Specifically /messages seems to be the slow culprit and I haven't really heard anything about optimizing it like /join. From a clean Synapse cache, /messages often takes 80s - 120s to respond and many times timing out itself from Synapse.

Here is the average data for #matrix:matrix.org from the gitter.ems.host homeserver which is sampled every 61 minutes to be outside of the state cache expiry, /_matrix/client/r0/rooms/${roomId}/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter={"lazy_load_members":true}

  • ?limit=500: 111.81s (492 events / 129 state events)
  • ?limit=100: 119.504s (92 events / 50 state events)
  • ?limit=20: 112.541s (20 events / 11 state events)

After Synapse has a warm cache (I am guessing), these requests take about 5s - 10s

What can we do to make the /messages requests faster? Has any thought already been put in this? For existing issues, I see that we might be backfilling when we shouldn't which aligns with the request being slow regardless how big the limit is, https://github.com/matrix-org/synapse/issues/10742

Caching the response on our end in the archive doesn't make the request fast in the first place.

From the matrix.org Grafana, I looked into the RoomMessageListRestServlet timing and it all appears very fast (5ms - 500ms) and not what I experience. I am guessing it's not good at catching the outliers since the metric seems to just to total time divided by samples. But I would expect a lot more 1 second to 3 second timings even after a warm cache.

-- @MadLittleMods, https://matrix.to/#/!vcyiEtMVHIhWXcJAfl:sw1v.org/$7ctkqN00-4HiYP5TT7bnfROfGXP5Lyt7yyWjrg8bGu4?via=matrix.org&via=element.io&via=beeper.com

Traces for reference

These include a bunch of the /messages instrumentation added:

Why is it slow?

Update: We now have a milestone tracking various known slow pieces to improve: https://github.com/matrix-org/synapse/milestone/11

This part is WIP. Just documenting some of the log diving I've done for /messages being slow. Still want to do this on some more requests and hopefully get access to Jaeger to also compare and investigate that way too.

1. Backfill linearizer lock takes forever

Notes:

Timing summary from looking at the logs:

  1. 03:42:43.129 (t+0s): Received request
  2. 03:42:43.712 (t+0s) Waiting to acquire linearizer lock 'room_backfill'
  3. 03:43:26.806 (t+43s) Acquired linearizer lock 'room_backfill'
  4. 03:43:26.825 (t+43s) _maybe_backfill_inner and pulled 4.6k events out of the database as potential backfill points
    • get_oldest_event_ids_with_depth_in_room only took .12s to get the 4.6k events
  5. 03:43:28.716 (t+45s) Asking t2bot.io for backill
  6. 03:43:28.985 (t+45s) t2bot.io responded with 100 events
  7. 03:43:29.009 (t+46s) Starting to verify content hashes
  8. 03:43:29.105 (t+46s) Done verifying content hashes
  9. 03:43:29.106 (t+46s) Processing the pulled events
  10. 03:43:29.147 (t+46s) Done processing 100 backfilled except for $0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
  11. 03:43:29.348 (t+46s) Done trying to de-outlier $0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E (404 /state_ids from t2bot.io)
  12. 03:43:34.026 (t+51s) _get_state_groups_from_groups start
  13. 03:43:38.470 (t+55s) _get_state_groups_from_groups end
  14. 03:43:41.112 (t+58s) Response sent

This isn't a fluke, here is another request I went through the logs&_a=(columns:!(message),index:'67745970-aa2d-11ea-96a7-efc84863d0ea',interval:auto,query:(language:kuery,query:'kubernetes.pod.labels.hostname:%22gitter.ems.host%22%20and%20message:%22GET-196%22%20and%20not%20message:%22synapse.util.metrics%22'),sort:!(!('@timestamp',desc)))) on (GET-196). This time the duration matched the Synapse logs pretty well:

  1. 20:38:38.049 (t+0s) Received request
  2. 20:38:38.062 (t+0s) Waiting to acquire linearizer lock 'room_backfill'
  3. 20:39:23.622 (t+45s) Acquired linearizer lock 'room_backfill'
  4. 20:39:23.640 (t+45s) _maybe_backfill_inner and pulled 4.6k events out of the database as potential backfill points
  5. 20:39:25.625 (t+47s) Asking t2bot.io for backill
  6. 20:39:35.262 (t+57s) t2bot.io responded with 100 events
  7. 20:39:35.283 (t+...) Starting to verify content hashes
  8. 20:39:35.382 (t+...) Done verifying content hashes
  9. 20:39:35.383 (t+...) Processing the pulled events
  10. 20:39:35.424 (t+...) Done processing 100 backfilled events except for $0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
  11. 20:39:35.651 (t+...) Done trying to de-outlier $0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E (404 /state_ids from t2bot.io)
  12. 20:39:43.668 (t+65s) Response sent

2. Loading tons of events

Notes:

Timing summary from looking at the logs:

  1. 02:09:51.026 (t+0s) Received request
  2. 02:09:51.959 (t+1s) _maybe_backfill_inner backfill and pulled 4.6k events out of the database as potential backfill points
  3. 02:09:52.726 (t+2s) synapse.storage.databases.main.events_worker Loading 79277 events (why?)
  4. 02:10:10.462 (t+19s) Also fetching redaction events
  5. 02:10:10.462 (t+19s) Loaded 79277 events
  6. 02:10:23.246 (t+31s) Done redacting 105 events (why?)
  7. 02:10:23.779 (t+31s) Asking t2bot.io for backill
  8. 02:10:33.290 (t+41s) t2bot.io responded with 100 events
  9. ... (t+...) 2k calls to get_recent_references_for_event and get_aggregation_groups_for_event (why?)
  10. 02:10:54.261 (t+63s) Response sent

3. Slow /state_id requests but also really slow processing

524 timeout: 123.26s `https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=ss4cm` 524 timeout 2022-07-22 @4-43 123.26s https://jaeger.proxy.matrix.org/trace/7c4b7fe54ba6f5ab - 19.24s `/backfill` request - 53.45s `/state_ids`
524 timeout: 117.96s `https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=0p14c` 524 timeout 2022-07-22 @4:56 117.96s https://jaeger.proxy.matrix.org/trace/e67f019385c47fd9 - 921ms `/backfill` request - 47.24s `/state_ids` request
524 timeout: 115.64s `https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=p8c09g` 524 timeout 2022-7-22 @5:02:33 ```json { "ok": false, "timing": 99185.70000004768, "startTime": 1658527353492, "endTime": 1658527452680, "requestId": "p8c09g" } ``` 115.64s https://jaeger.proxy.matrix.org/trace/ef47a44ea445b3ae - 14.49s `/backfill` request - 54.47s `/state_ids` request
200 ok: 83.51s `https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=cjfpw` 200 ok ```json { "ok": true, "status": 200, "timing": 83771.90000009537, "numEventsReturned": 491, "numStateEventsReturned": 129, "startTime": 1658527887072, "endTime": 1658527970847, "requestId": "cjfpw", "traceId": "ae7c694e57113282" } ``` 83.51s https://jaeger.proxy.matrix.org/trace/ae7c694e57113282 - 526ms `/backfill` request - 42.77s `/state_ids` request
200 ok: 75.7s `https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=wg6g8k` 2022-07-22 @5-27pm ```json { "ok": true, "status": 200, "timing": 76131.40000009537, "numEventsReturned": 491, "numStateEventsReturned": 129, "startTime": 1658528812471, "endTime": 1658528888604, "requestId": "wg6g8k" "traceId": "d048d9f59c20555c" } ``` 75.7s https://jaeger.proxy.matrix.org/trace/d048d9f59c20555c - 549ms `/backfill` request - 32.58s `/state_ids` request

/messages timing script

Every 61 minutes (to be outside of the state cache expiry), will call and time /messages for each room defined. And will do this with ?limit= 500, 100, 20 to see if the amount of messages matters on the timing.

Run in the browser. Define let MATRIX_TOKEN = 'xxx'; before running the script.

matrix-messages-timing.js ```js let resultantTimingData = {}; async function fetchAndTimeMessagesFromRoom(roomId, numMessagesToFetch) { // via https://stackoverflow.com/a/8084248/796832 const requestId = (Math.random() + 1).toString(36).substring(7); let res; let timeBeforeRes; let timeAfterRes; let startTime; let endTime; try { startTime = Date.now(); timeBeforeRes = performance.now(); res = await fetch(`https://matrix-client.matrix.org/_matrix/client/r0/rooms/${roomId}/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=${numMessagesToFetch}&filter=%7B%22lazy_load_members%22:true%7D&foo=${requestId}`, { headers: { 'Authorization': `Bearer ${MATRIX_TOKEN}` } }); } catch(err) { console.error(`Error getting result for ${roomId} at ${numMessagesToFetch} messages`, err); } finally { endTime = Date.now(); timeAfterRes = performance.now(); } const traceId = res?.headers?.['synapse-trace-id']; let data = {}; if (res?.ok) { data = await res.json(); } const timingEntry = { ok: res?.ok || false, status: res?.status, timing: timeAfterRes - timeBeforeRes, numEventsReturned: data?.chunk?.length, numStateEventsReturned: data?.state?.length, startTime, endTime, requestId, traceId }; return timingEntry; } (async () => { const ROOMS_TO_TEST = { 'matrix:matrix.org': '!OGEhHVWSdvArJzumhm:matrix.org', 'openwisp/general': '!RBzfoBeqYcCwLAAenz:gitter.im', 'ethereum/solidity-dev': '!poXqlbVpQfXKWGseLY:gitter.im', 'matrix-org/gitter': '!OvgDmYaPOFRRWnIdQe:half-shot.uk', 'MadLittleMods/new-room1': '!aMzLHLvScQCGKDNqCB:gitter.im' }; const NUM_MESSAGES_TO_FETCH_MATRIX = [ 500, 100, 20 ]; const ONE_MINUTE_MS = 60 * 1000; // The `_state_cache` in Synapse has 60 minute expiry so we chose something just bigger const DELAY_BETWEEN_TRIALS = 61 * ONE_MINUTE_MS; const REPEAT = 5; if (!MATRIX_TOKEN) { console.error('MATRIX_TOKEN is not defined. Please define this before running this script'); } for (let i = 0; i < REPEAT; i++) { for (let numMessagesToFetch of NUM_MESSAGES_TO_FETCH_MATRIX) { for (let roomKey of Object.keys(ROOMS_TO_TEST)) { const roomId = ROOMS_TO_TEST[roomKey]; const timingEntry = await fetchAndTimeMessagesFromRoom(roomId, numMessagesToFetch); // Default the result map so we don't run into null-pointer exception (NPE) issues if (!resultantTimingData[roomKey]) { resultantTimingData[roomKey] = {}; } if (!resultantTimingData[roomKey][numMessagesToFetch]) { resultantTimingData[roomKey][numMessagesToFetch] = []; } resultantTimingData[roomKey][numMessagesToFetch].push(timingEntry); console.log(`New result for ${roomKey} at ${numMessagesToFetch} messages:`, timingEntry); } // Wait for the caches to clear out before running another trial against the same room await new Promise((resolve) => { setTimeout(resolve, DELAY_BETWEEN_TRIALS); }) } } })() ``` ```js function getAvg(roomId, numMessages) { const timingEntries = data1[roomId][numMessages]; return timingEntries .filter((timingEntry) => timingEntry.ok) .reduce((total, timingEntry) => (total + timingEntry.timing), 0) / timingEntries.length; } getAvg('matrix:matrix.org', "500") ```
richvdh commented 2 years ago

possibly related: https://github.com/matrix-org/synapse/issues/13677

MadLittleMods commented 2 years ago

Progress

We've made some good progress on these slow 180s type requests. You can see the big drop in these long-running requests around Sept. 26 after we shipped https://github.com/matrix-org/synapse/pull/13635 (possible with all of the PRs before it). This metric can be influenced by other servers being slow to respond to /backfill, /state_ids, /state, /event from federation so it’s not like we can eliminate all slow requests but it seems obvious that they happen a lot less now.

https://grafana.matrix.org/d/dYoRgTgVz/messages-timing?orgId=1&from=1663806404570&to=1665016004570&viewPanel=230 Grafana graph showing: Of requests that take more than 5s, what percentage take 180s or longer

Grafana graph showing: Of requests that take more than 5s, what percentage take 120s or longer

Grafana graph showing: Of requests that take more than 5s, what percentage take 60s or longer

The effects don't really show for the RoomMessageListRestServlet overall. Perhaps because they have a low ?limit, they don’t have to backfill, maybe they are going forwards instead of backwards, etc. Another influencing factor is that I have only focused on the big requests that took 180s+ before which now take around 30s with the optimizations and this isn’t within range of the Apdex 2.5s satisfied, 10s tolerable thresholds to show progress. I tried playing with those thresholds but didn’t see much. Too many normal fast requests 🤷

What's left?

/messages is still not anywhere close to fast enough as it still takes 30s for my benchmark request I have been running that used to take 180s. A great goal would be to get all of these down to 5s or less.

Here is the latest trace now that we're down in the ~30s range, https://gist.github.com/MadLittleMods/c3ace249ce2aec77cb78d19a4f85a776

~30s /messages trace in Jaeger

As you can see, we need to:

  1. Speed-up filter_events_for_client (this takes 7.83s in that request), https://github.com/matrix-org/synapse/issues/14108
  2. Speed-up get_bundled_aggregations (this takes 3.84s in that request), https://github.com/matrix-org/synapse/issues/13624
  3. Speed up the maybe_backfill part (this takes 15.29s in that request),
    • Speed up _process_pulled_events in backfill (this takes 5.15s in that request), https://github.com/matrix-org/synapse/issues/13623
    • 7.24s is from the /backfill outgoing federation request which we have less control over unless we can figure out that we don't actually need to backfill in that situation.
    • That still leaves 2.9s of random stuff which should surely be able to go away 🤞

If we can get rid of the time in the bolded numbers, that saves us 20s gets us down to 10s /messages requests. We will have to look at the traces again once we get /messages down to the 10s mark.

MadLittleMods commented 1 year ago

Progress report

The benchmark /messages?dir=b&limit=500 request is down to ~4s now (we originally started at over 185s+)

Full Jaeger trace JSON

Benchmark request: https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D

What's left?

The big things:

  1. get_state_for_events/get_state_ids_for_event -> _get_state_for_groups -> _get_state_groups_from_groups took 1.94s in that request
    • This is comprised of long database queries that take 500ms - 1000ms each.
    • We've optimized the filter_events_for_client -> get_state_for_events -> _get_state_groups_from_groups path before in https://github.com/matrix-org/synapse/pull/14527 where we only need the m.room.history_visibility and m.room.member event.
    • I also tried tackling this more the more general problem when we have to grab the full state out https://github.com/matrix-org/synapse/pull/15617 but found out that our state_group sharing is really sub-par. Maybe it's more interesting to improve our state_group logic to be much smarter first and we could re-visit that PR. Or look into the state compressor stuff to optimize our backlog.
    • It would be really nice to get some gains out of this but it's proving hard.
  2. paginate_room_events -> get_events_as_list took 551ms in that request
    • :star: Seems like lots of individual lookups that we could simplify

The smaller things:

But all of that notable stuff only totals up 3057ms so there is still 373ms of random baggage.


In terms of waterfalls:

There is a 160ms mystery gap after encode_json_response

What's achievable?

Feels do-able to reduce the response time by another 1s (I've starred some things above). And for example if we get another /messages response that looks like this trace, it could get us down to that target 1.85s mark to say "we've improved /messages by 100x"