Investigate why `/messages` is slow

MadLittleMods commented 2 years ago

Background

For Matrix Public Archive, we're getting a lot of 504 timeouts waiting for Matrix API requests to finish. Specifically /messages seems to be the slow culprit and I haven't really heard anything about optimizing it like /join. From a clean Synapse cache, /messages often takes 80s - 120s to respond and many times timing out itself from Synapse.

Here is the average data for #matrix:matrix.org from the gitter.ems.host homeserver which is sampled every 61 minutes to be outside of the state cache expiry, /_matrix/client/r0/rooms/${roomId}/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter={"lazy_load_members":true}

?limit=500: 111.81s (492 events / 129 state events)

?limit=100: 119.504s (92 events / 50 state events)

?limit=20: 112.541s (20 events / 11 state events)

After Synapse has a warm cache (I am guessing), these requests take about 5s - 10s

What can we do to make the /messages requests faster? Has any thought already been put in this? For existing issues, I see that we might be backfilling when we shouldn't which aligns with the request being slow regardless how big the limit is, https://github.com/matrix-org/synapse/issues/10742

Caching the response on our end in the archive doesn't make the request fast in the first place.

From the matrix.org Grafana, I looked into the RoomMessageListRestServlet timing and it all appears very fast (5ms - 500ms) and not what I experience. I am guessing it's not good at catching the outliers since the metric seems to just to total time divided by samples. But I would expect a lot more 1 second to 3 second timings even after a warm cache.

-- @MadLittleMods, https://matrix.to/#/!vcyiEtMVHIhWXcJAfl:sw1v.org/$7ctkqN00-4HiYP5TT7bnfROfGXP5Lyt7yyWjrg8bGu4?via=matrix.org&via=element.io&via=beeper.com

Traces for reference

These include a bunch of the /messages instrumentation added:

144.93s /messages (RoomMessageListRestServlet) on matrix.org Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/dfa0dad8a32255b366461c76b55c8ecb
135.3s /messages (RoomMessageListRestServlet) on matrix.org Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/5b945a1733a414c61be6e393397c3f9d
154.72s /messages (RoomMessageListRestServlet) on matrix.org Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/d3f9c621707081755ff629000c60b3f3
185.9s /messages (RoomMessageListRestServlet) on matrix.org Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/4b12d0d0afe88c2f65ffcc907306b761

Why is it slow?

Update: We now have a milestone tracking various known slow pieces to improve: https://github.com/matrix-org/synapse/milestone/11

This part is WIP. Just documenting some of the log diving I've done for /messages being slow. Still want to do this on some more requests and hopefully get access to Jaeger to also compare and investigate that way too.

1. Backfill linearizer lock takes forever

Room: #matrix:matrix.org
API URL: https://gitter.ems.host/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=12345678
Request -> response duration: 117.808s
Logs in Kibana&_a=(columns:!(message),index:'67745970-aa2d-11ea-96a7-efc84863d0ea',interval:auto,query:(language:kuery,query:'kubernetes.pod.labels.hostname:%22gitter.ems.host%22%20and%20message:%22GET-6890%22%20and%20not%20message:%22synapse.util.metrics%22'),sort:!(!('@timestamp',desc)))): GET-6890
- For reference on adjusting the logging config for gitter.ems.host (set DEBUG), see this comment.

Notes:

Why does linearizer lock take ~45s to aquire lock?
It doesn't seem to take a lot of time but we don't need to fetch 4.6k backfill points out of the database
For me on the outside, the request took 117.808s. Where did the rest of the time go?
- Delay in connecting to the server?
- Delay in serializing the response?
- Need to time the requests more precisely when the started and ended to compare against the Synapse logs

Timing summary from looking at the logs:

03:42:43.129 (t+0s): Received request
03:42:43.712 (t+0s) Waiting to acquire linearizer lock 'room_backfill'
03:43:26.806 (t+43s) Acquired linearizer lock 'room_backfill'
03:43:26.825 (t+43s) _maybe_backfill_inner and pulled 4.6k events out of the database as potential backfill points
- get_oldest_event_ids_with_depth_in_room only took .12s to get the 4.6k events
03:43:28.716 (t+45s) Asking t2bot.io for backill
03:43:28.985 (t+45s) t2bot.io responded with 100 events
03:43:29.009 (t+46s) Starting to verify content hashes
03:43:29.105 (t+46s) Done verifying content hashes
03:43:29.106 (t+46s) Processing the pulled events
03:43:29.147 (t+46s) Done processing 100 backfilled except for $0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
03:43:29.348 (t+46s) Done trying to de-outlier $0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E (404 /state_ids from t2bot.io)
03:43:34.026 (t+51s) _get_state_groups_from_groups start
03:43:38.470 (t+55s) _get_state_groups_from_groups end
03:43:41.112 (t+58s) Response sent

This isn't a fluke, here is another request I went through the logs&_a=(columns:!(message),index:'67745970-aa2d-11ea-96a7-efc84863d0ea',interval:auto,query:(language:kuery,query:'kubernetes.pod.labels.hostname:%22gitter.ems.host%22%20and%20message:%22GET-196%22%20and%20not%20message:%22synapse.util.metrics%22'),sort:!(!('@timestamp',desc)))) on (GET-196). This time the duration matched the Synapse logs pretty well:

20:38:38.049 (t+0s) Received request
20:38:38.062 (t+0s) Waiting to acquire linearizer lock 'room_backfill'
20:39:23.622 (t+45s) Acquired linearizer lock 'room_backfill'
20:39:23.640 (t+45s) _maybe_backfill_inner and pulled 4.6k events out of the database as potential backfill points
20:39:25.625 (t+47s) Asking t2bot.io for backill
20:39:35.262 (t+57s) t2bot.io responded with 100 events
20:39:35.283 (t+...) Starting to verify content hashes
20:39:35.382 (t+...) Done verifying content hashes
20:39:35.383 (t+...) Processing the pulled events
20:39:35.424 (t+...) Done processing 100 backfilled events except for $0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
20:39:35.651 (t+...) Done trying to de-outlier $0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E (404 /state_ids from t2bot.io)
20:39:43.668 (t+65s) Response sent

2. Loading tons of events

Room: #matrix:matrix.org
API URL: https://gitter.ems.host/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=12345678
Request -> response duration: ~60s
Logs in Kibana: GET-5541
- 1&_a=(columns:!(message),index:'67745970-aa2d-11ea-96a7-efc84863d0ea',interval:auto,query:(language:kuery,query:'kubernetes.pod.labels.hostname:%22gitter.ems.host%22%20and%20message:%22GET-5541%22%20and%20not%20message:%22synapse.util.metrics%22'),sort:!(!('@timestamp',desc))))
- 2&_a=(columns:!(message),index:'67745970-aa2d-11ea-96a7-efc84863d0ea',interval:auto,query:(language:kuery,query:'kubernetes.pod.labels.hostname:%22gitter.ems.host%22%20and%20message:%22GET-5541%22%20and%20not%20message:%22synapse.util.metrics%22%20and%20not%20message:%22synapse.storage.txn%22%20and%20not%20message:%22synapse.storage.databases.main.events_worker%22'),sort:!(!('@timestamp',desc))))

Notes:

Tons of get_aggregation_groups_for_event of get_recent_references_for_event calls (2k). Why?
Why did we load 79k events?
- This lines up with the number of state_events in the room
Why did we redact some events? Is this redacting events in our response or in general?
Why didn't these things happen in the other calls above?

Timing summary from looking at the logs:

02:09:51.026 (t+0s) Received request
02:09:51.959 (t+1s) _maybe_backfill_inner backfill and pulled 4.6k events out of the database as potential backfill points
02:09:52.726 (t+2s) synapse.storage.databases.main.events_worker Loading 79277 events (why?)
02:10:10.462 (t+19s) Also fetching redaction events
02:10:10.462 (t+19s) Loaded 79277 events
02:10:23.246 (t+31s) Done redacting 105 events (why?)
02:10:23.779 (t+31s) Asking t2bot.io for backill
02:10:33.290 (t+41s) t2bot.io responded with 100 events
... (t+...) 2k calls to get_recent_references_for_event and get_aggregation_groups_for_event (why?)
02:10:54.261 (t+63s) Response sent

3. Slow `/state_id` requests but also really slow processing

524 timeout: 123.26s

`https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=ss4cm` 524 timeout 2022-07-22 @4-43 123.26s https://jaeger.proxy.matrix.org/trace/7c4b7fe54ba6f5ab - 19.24s `/backfill` request - 53.45s `/state_ids`

524 timeout: 117.96s

`https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=0p14c` 524 timeout 2022-07-22 @4:56 117.96s https://jaeger.proxy.matrix.org/trace/e67f019385c47fd9 - 921ms `/backfill` request - 47.24s `/state_ids` request

524 timeout: 115.64s

`https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=p8c09g` 524 timeout 2022-7-22 @5:02:33 ```json { "ok": false, "timing": 99185.70000004768, "startTime": 1658527353492, "endTime": 1658527452680, "requestId": "p8c09g" } ``` 115.64s https://jaeger.proxy.matrix.org/trace/ef47a44ea445b3ae - 14.49s `/backfill` request - 54.47s `/state_ids` request

200 ok: 83.51s

`https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=cjfpw` 200 ok ```json { "ok": true, "status": 200, "timing": 83771.90000009537, "numEventsReturned": 491, "numStateEventsReturned": 129, "startTime": 1658527887072, "endTime": 1658527970847, "requestId": "cjfpw", "traceId": "ae7c694e57113282" } ``` 83.51s https://jaeger.proxy.matrix.org/trace/ae7c694e57113282 - 526ms `/backfill` request - 42.77s `/state_ids` request

200 ok: 75.7s

`https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=wg6g8k` 2022-07-22 @5-27pm ```json { "ok": true, "status": 200, "timing": 76131.40000009537, "numEventsReturned": 491, "numStateEventsReturned": 129, "startTime": 1658528812471, "endTime": 1658528888604, "requestId": "wg6g8k" "traceId": "d048d9f59c20555c" } ``` 75.7s https://jaeger.proxy.matrix.org/trace/d048d9f59c20555c - 549ms `/backfill` request - 32.58s `/state_ids` request

`/messages` timing script

Every 61 minutes (to be outside of the state cache expiry), will call and time /messages for each room defined. And will do this with ?limit= 500, 100, 20 to see if the amount of messages matters on the timing.

Run in the browser. Define let MATRIX_TOKEN = 'xxx'; before running the script.

matrix-messages-timing.js

```js let resultantTimingData = {}; async function fetchAndTimeMessagesFromRoom(roomId, numMessagesToFetch) { // via https://stackoverflow.com/a/8084248/796832 const requestId = (Math.random() + 1).toString(36).substring(7); let res; let timeBeforeRes; let timeAfterRes; let startTime; let endTime; try { startTime = Date.now(); timeBeforeRes = performance.now(); res = await fetch(`https://matrix-client.matrix.org/_matrix/client/r0/rooms/${roomId}/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=${numMessagesToFetch}&filter=%7B%22lazy_load_members%22:true%7D&foo=${requestId}`, { headers: { 'Authorization': `Bearer ${MATRIX_TOKEN}` } }); } catch(err) { console.error(`Error getting result for ${roomId} at ${numMessagesToFetch} messages`, err); } finally { endTime = Date.now(); timeAfterRes = performance.now(); } const traceId = res?.headers?.['synapse-trace-id']; let data = {}; if (res?.ok) { data = await res.json(); } const timingEntry = { ok: res?.ok || false, status: res?.status, timing: timeAfterRes - timeBeforeRes, numEventsReturned: data?.chunk?.length, numStateEventsReturned: data?.state?.length, startTime, endTime, requestId, traceId }; return timingEntry; } (async () => { const ROOMS_TO_TEST = { 'matrix:matrix.org': '!OGEhHVWSdvArJzumhm:matrix.org', 'openwisp/general': '!RBzfoBeqYcCwLAAenz:gitter.im', 'ethereum/solidity-dev': '!poXqlbVpQfXKWGseLY:gitter.im', 'matrix-org/gitter': '!OvgDmYaPOFRRWnIdQe:half-shot.uk', 'MadLittleMods/new-room1': '!aMzLHLvScQCGKDNqCB:gitter.im' }; const NUM_MESSAGES_TO_FETCH_MATRIX = [ 500, 100, 20 ]; const ONE_MINUTE_MS = 60 * 1000; // The `_state_cache` in Synapse has 60 minute expiry so we chose something just bigger const DELAY_BETWEEN_TRIALS = 61 * ONE_MINUTE_MS; const REPEAT = 5; if (!MATRIX_TOKEN) { console.error('MATRIX_TOKEN is not defined. Please define this before running this script'); } for (let i = 0; i < REPEAT; i++) { for (let numMessagesToFetch of NUM_MESSAGES_TO_FETCH_MATRIX) { for (let roomKey of Object.keys(ROOMS_TO_TEST)) { const roomId = ROOMS_TO_TEST[roomKey]; const timingEntry = await fetchAndTimeMessagesFromRoom(roomId, numMessagesToFetch); // Default the result map so we don't run into null-pointer exception (NPE) issues if (!resultantTimingData[roomKey]) { resultantTimingData[roomKey] = {}; } if (!resultantTimingData[roomKey][numMessagesToFetch]) { resultantTimingData[roomKey][numMessagesToFetch] = []; } resultantTimingData[roomKey][numMessagesToFetch].push(timingEntry); console.log(`New result for ${roomKey} at ${numMessagesToFetch} messages:`, timingEntry); } // Wait for the caches to clear out before running another trial against the same room await new Promise((resolve) => { setTimeout(resolve, DELAY_BETWEEN_TRIALS); }) } } })() ``` ```js function getAvg(roomId, numMessages) { const timingEntries = data1[roomId][numMessages]; return timingEntries .filter((timingEntry) => timingEntry.ok) .reduce((total, timingEntry) => (total + timingEntry.timing), 0) / timingEntries.length; } getAvg('matrix:matrix.org', "500") ```

richvdh commented 2 years ago

MadLittleMods commented 2 years ago

Progress

We've made some good progress on these slow 180s type requests. You can see the big drop in these long-running requests around Sept. 26 after we shipped https://github.com/matrix-org/synapse/pull/13635 (possible with all of the PRs before it). This metric can be influenced by other servers being slow to respond to /backfill, /state_ids, /state, /event from federation so it’s not like we can eliminate all slow requests but it seems obvious that they happen a lot less now.

https://grafana.matrix.org/d/dYoRgTgVz/messages-timing?orgId=1&from=1663806404570&to=1665016004570&viewPanel=230 Grafana graph showing: Of requests that take more than 5s, what percentage take 180s or longer

Grafana graph showing: Of requests that take more than 5s, what percentage take 120s or longer

Grafana graph showing: Of requests that take more than 5s, what percentage take 60s or longer

The effects don't really show for the RoomMessageListRestServlet overall. Perhaps because they have a low ?limit, they don’t have to backfill, maybe they are going forwards instead of backwards, etc. Another influencing factor is that I have only focused on the big requests that took 180s+ before which now take around 30s with the optimizations and this isn’t within range of the Apdex 2.5s satisfied, 10s tolerable thresholds to show progress. I tried playing with those thresholds but didn’t see much. Too many normal fast requests 🤷

What's left?

/messages is still not anywhere close to fast enough as it still takes 30s for my benchmark request I have been running that used to take 180s. A great goal would be to get all of these down to 5s or less.

Here is the latest trace now that we're down in the ~30s range, https://gist.github.com/MadLittleMods/c3ace249ce2aec77cb78d19a4f85a776

As you can see, we need to:

Speed-up filter_events_for_client (this takes 7.83s in that request), https://github.com/matrix-org/synapse/issues/14108
Speed-up get_bundled_aggregations (this takes 3.84s in that request), https://github.com/matrix-org/synapse/issues/13624
Speed up the maybe_backfill part (this takes 15.29s in that request),
- Speed up _process_pulled_events in backfill (this takes 5.15s in that request), https://github.com/matrix-org/synapse/issues/13623
- 7.24s is from the /backfill outgoing federation request which we have less control over unless we can figure out that we don't actually need to backfill in that situation.
- That still leaves 2.9s of random stuff which should surely be able to go away 🤞

If we can get rid of the time in the bolded numbers, that saves us 20s gets us down to 10s /messages requests. We will have to look at the traces again once we get /messages down to the 10s mark.

MadLittleMods commented 1 year ago

Progress report

The benchmark /messages?dir=b&limit=500 request is down to ~4s now (we originally started at over 185s+)

Full Jaeger trace JSON

Benchmark request: https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D

What's left?

The big things:

get_state_for_events/get_state_ids_for_event -> _get_state_for_groups -> _get_state_groups_from_groups took 1.94s in that request
- This is comprised of long database queries that take 500ms - 1000ms each.
- We've optimized the filter_events_for_client -> get_state_for_events -> _get_state_groups_from_groups path before in https://github.com/matrix-org/synapse/pull/14527 where we only need the m.room.history_visibility and m.room.member event.
- I also tried tackling this more the more general problem when we have to grab the full state out https://github.com/matrix-org/synapse/pull/15617 but found out that our state_group sharing is really sub-par. Maybe it's more interesting to improve our state_group logic to be much smarter first and we could re-visit that PR. Or look into the state compressor stuff to optimize our backlog.
- It would be really nice to get some gains out of this but it's proving hard.
paginate_room_events -> get_events_as_list took 551ms in that request
- :star: Seems like lots of individual lookups that we could simplify

The smaller things:

get_bundled_aggregations took 200ms from that request
get_state_for_events/get_state_ids_for_event -> get_state_group_for_events took 229ms from that request
get_state_for_events/get_state_ids_for_event -> _get_state_for_groups_using_cache took 137ms from that request

But all of that notable stuff only totals up 3057ms so there is still 373ms of random baggage.

In terms of waterfalls:

We could save 490ms by running get_state_ids_for_event in parallel with filter_events_for_client since it only needs state at the first event. And we don't need to worry whether the first event would have been filtered because the state just builds on top.
- :star: Can we instead just re-use the work to get all the state out for filter_events_for_client when we try to get_state_ids_for_event at the end to return in the /messages response? That would eliminate the 490ms lookup.
:star: We could save 200ms by running get_bundled_aggregations in parallel with filter_events_for_client. We might grab a few more aggregations but can always filter them down.
We could save 30ms by removing the ignored_users -> are_users_erased sequential call waterfall from filter_events_for_client
- This sometimes runs fast and is negligible so I'm not sure the overhead of running in parallel is worth it.
:star: Instead of waiting for us to process the whole 500 events before moving on to the next step, could we break it up into a pipeline and stream results through as we go (perhaps 100 at a time).

There is a 160ms mystery gap after encode_json_response

What's achievable?

Feels do-able to reduce the response time by another 1s (I've starred some things above). And for example if we get another /messages response that looks like this trace, it could get us down to that target 1.85s mark to say "we've improved /messages by 100x"

matrix-org / synapse