Open MadLittleMods opened 2 years ago
possibly related: https://github.com/matrix-org/synapse/issues/13677
We've made some good progress on these slow 180s type requests. You can see the big drop in these long-running requests around Sept. 26 after we shipped https://github.com/matrix-org/synapse/pull/13635 (possible with all of the PRs before it). This metric can be influenced by other servers being slow to respond to /backfill
, /state_ids
, /state
, /event
from federation so it’s not like we can eliminate all slow requests but it seems obvious that they happen a lot less now.
The effects don't really show for the RoomMessageListRestServlet
overall. Perhaps because they have a low ?limit
, they don’t have to backfill, maybe they are going forwards instead of backwards, etc. Another influencing factor is that I have only focused on the big requests that took 180s+ before which now take around 30s with the optimizations and this isn’t within range of the Apdex 2.5s satisfied, 10s tolerable thresholds to show progress. I tried playing with those thresholds but didn’t see much. Too many normal fast requests 🤷
/messages
is still not anywhere close to fast enough as it still takes 30s for my benchmark request I have been running that used to take 180s. A great goal would be to get all of these down to 5s or less.
Here is the latest trace now that we're down in the ~30s range, https://gist.github.com/MadLittleMods/c3ace249ce2aec77cb78d19a4f85a776
As you can see, we need to:
filter_events_for_client
(this takes 7.83s in that request), https://github.com/matrix-org/synapse/issues/14108get_bundled_aggregations
(this takes 3.84s in that request), https://github.com/matrix-org/synapse/issues/13624maybe_backfill
part (this takes 15.29s in that request),
_process_pulled_events
in backfill
(this takes 5.15s in that request), https://github.com/matrix-org/synapse/issues/136237.24s
is from the /backfill
outgoing federation request which we have less control over unless we can figure out that we don't actually need to backfill in that situation.If we can get rid of the time in the bolded numbers, that saves us 20s gets us down to 10s /messages
requests. We will have to look at the traces again once we get /messages
down to the 10s mark.
The benchmark /messages?dir=b&limit=500
request is down to ~4s now (we originally started at over 185s+)
Benchmark request: https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D
The big things:
get_state_for_events
/get_state_ids_for_event
-> _get_state_for_groups
-> _get_state_groups_from_groups
took 1.94s in that request
filter_events_for_client
-> get_state_for_events
-> _get_state_groups_from_groups
path before in https://github.com/matrix-org/synapse/pull/14527 where we only need the m.room.history_visibility
and m.room.member
event.state_group
sharing is really sub-par. Maybe it's more interesting to improve our state_group
logic to be much smarter first and we could re-visit that PR. Or look into the state compressor stuff to optimize our backlog.paginate_room_events
-> get_events_as_list
took 551ms in that request
The smaller things:
get_bundled_aggregations
took 200ms from that requestget_state_for_events
/get_state_ids_for_event
-> get_state_group_for_events
took 229ms from that requestget_state_for_events
/get_state_ids_for_event
-> _get_state_for_groups_using_cache
took 137ms from that requestBut all of that notable stuff only totals up 3057ms so there is still 373ms of random baggage.
In terms of waterfalls:
get_state_ids_for_event
in parallel with filter_events_for_client
since it only needs state at the first event. And we don't need to worry whether the first event would have been filtered because the state just builds on top.
filter_events_for_client
when we try to get_state_ids_for_event
at the end to return in the /messages
response? That would eliminate the 490ms lookup.get_bundled_aggregations
in parallel with filter_events_for_client
. We might grab a few more aggregations but can always filter them down.ignored_users
-> are_users_erased
sequential call waterfall from filter_events_for_client
There is a 160ms mystery gap after encode_json_response
Feels do-able to reduce the response time by another 1s (I've starred some things above). And for example if we get another /messages
response that looks like this trace, it could get us down to that target 1.85s mark to say "we've improved /messages
by 100x"
Background
Traces for reference
These include a bunch of the
/messages
instrumentation added:/messages
(RoomMessageListRestServlet
) onmatrix.org
Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/dfa0dad8a32255b366461c76b55c8ecb/messages
(RoomMessageListRestServlet
) onmatrix.org
Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/5b945a1733a414c61be6e393397c3f9d/messages
(RoomMessageListRestServlet
) onmatrix.org
Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/d3f9c621707081755ff629000c60b3f3/messages
(RoomMessageListRestServlet
) onmatrix.org
Synapse trace - Jaeger trace, https://gist.github.com/MadLittleMods/4b12d0d0afe88c2f65ffcc907306b761Why is it slow?
Update: We now have a milestone tracking various known slow pieces to improve: https://github.com/matrix-org/synapse/milestone/11
This part is WIP. Just documenting some of the log diving I've done for
/messages
being slow. Still want to do this on some more requests and hopefully get access to Jaeger to also compare and investigate that way too.1. Backfill linearizer lock takes forever
#matrix:matrix.org
https://gitter.ems.host/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=12345678
117.808s
GET-6890
gitter.ems.host
(setDEBUG
), see this comment.Notes:
117.808s
. Where did the rest of the time go?Timing summary from looking at the logs:
03:42:43.129
(t+0s
): Received request03:42:43.712
(t+0s
) Waiting to acquire linearizer lock 'room_backfill'03:43:26.806
(t+43s
) Acquired linearizer lock 'room_backfill'03:43:26.825
(t+43s
) _maybe_backfill_inner and pulled 4.6k events out of the database as potential backfill pointsget_oldest_event_ids_with_depth_in_room
only took .12s to get the 4.6k events03:43:28.716
(t+45s
) Asking t2bot.io for backill03:43:28.985
(t+45s
) t2bot.io responded with 100 events03:43:29.009
(t+46s
) Starting to verify content hashes03:43:29.105
(t+46s
) Done verifying content hashes03:43:29.106
(t+46s
) Processing the pulled events03:43:29.147
(t+46s
) Done processing 100 backfilled except for$0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
03:43:29.348
(t+46s
) Done trying to de-outlier$0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
(404/state_ids
from t2bot.io)03:43:34.026
(t+51s
)_get_state_groups_from_groups
start03:43:38.470
(t+55s
)_get_state_groups_from_groups
end03:43:41.112
(t+58s
) Response sentThis isn't a fluke, here is another request I went through the logs&_a=(columns:!(message),index:'67745970-aa2d-11ea-96a7-efc84863d0ea',interval:auto,query:(language:kuery,query:'kubernetes.pod.labels.hostname:%22gitter.ems.host%22%20and%20message:%22GET-196%22%20and%20not%20message:%22synapse.util.metrics%22'),sort:!(!('@timestamp',desc)))) on (
GET-196
). This time the duration matched the Synapse logs pretty well:20:38:38.049
(t+0s
) Received request20:38:38.062
(t+0s
) Waiting to acquire linearizer lock 'room_backfill'20:39:23.622
(t+45s
) Acquired linearizer lock 'room_backfill'20:39:23.640
(t+45s
)_maybe_backfill_inner
and pulled 4.6k events out of the database as potential backfill points20:39:25.625
(t+47s
) Asking t2bot.io for backill20:39:35.262
(t+57s
) t2bot.io responded with 100 events20:39:35.283
(t+...
) Starting to verify content hashes20:39:35.382
(t+...
) Done verifying content hashes20:39:35.383
(t+...
) Processing the pulled events20:39:35.424
(t+...
) Done processing 100 backfilled events except for$0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
20:39:35.651
(t+...
) Done trying to de-outlier$0hu5zprqu6nLC24ISIZL2tL7rpfALh7eOI9MI6CV_1E
(404/state_ids
from t2bot.io)20:39:43.668
(t+65s
) Response sent2. Loading tons of events
#matrix:matrix.org
https://gitter.ems.host/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=12345678
GET-5541
Notes:
get_aggregation_groups_for_event
ofget_recent_references_for_event
calls (2k). Why?state_events
in the roomTiming summary from looking at the logs:
02:09:51.026
(t+0s
) Received request02:09:51.959
(t+1s
)_maybe_backfill_inner
backfill and pulled 4.6k events out of the database as potential backfill points02:09:52.726
(t+2s
) synapse.storage.databases.main.events_worker Loading 79277 events (why?)02:10:10.462
(t+19s
) Also fetching redaction events02:10:10.462
(t+19s
) Loaded 79277 events02:10:23.246
(t+31s
) Done redacting 105 events (why?)02:10:23.779
(t+31s
) Asking t2bot.io for backill02:10:33.290
(t+41s
) t2bot.io responded with 100 events...
(t+...
) 2k calls toget_recent_references_for_event
andget_aggregation_groups_for_event
(why?)02:10:54.261
(t+63s
) Response sent3. Slow
/state_id
requests but also really slow processing524 timeout: 123.26s
`https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=ss4cm` 524 timeout 2022-07-22 @4-43 123.26s https://jaeger.proxy.matrix.org/trace/7c4b7fe54ba6f5ab - 19.24s `/backfill` request - 53.45s `/state_ids`524 timeout: 117.96s
`https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=0p14c` 524 timeout 2022-07-22 @4:56 117.96s https://jaeger.proxy.matrix.org/trace/e67f019385c47fd9 - 921ms `/backfill` request - 47.24s `/state_ids` request524 timeout: 115.64s
`https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=p8c09g` 524 timeout 2022-7-22 @5:02:33 ```json { "ok": false, "timing": 99185.70000004768, "startTime": 1658527353492, "endTime": 1658527452680, "requestId": "p8c09g" } ``` 115.64s https://jaeger.proxy.matrix.org/trace/ef47a44ea445b3ae - 14.49s `/backfill` request - 54.47s `/state_ids` request200 ok: 83.51s
`https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=cjfpw` 200 ok ```json { "ok": true, "status": 200, "timing": 83771.90000009537, "numEventsReturned": 491, "numStateEventsReturned": 129, "startTime": 1658527887072, "endTime": 1658527970847, "requestId": "cjfpw", "traceId": "ae7c694e57113282" } ``` 83.51s https://jaeger.proxy.matrix.org/trace/ae7c694e57113282 - 526ms `/backfill` request - 42.77s `/state_ids` request200 ok: 75.7s
`https://matrix-client.matrix.org/_matrix/client/r0/rooms/!OGEhHVWSdvArJzumhm:matrix.org/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=500&filter=%7B%22lazy_load_members%22:true%7D&foo=wg6g8k` 2022-07-22 @5-27pm ```json { "ok": true, "status": 200, "timing": 76131.40000009537, "numEventsReturned": 491, "numStateEventsReturned": 129, "startTime": 1658528812471, "endTime": 1658528888604, "requestId": "wg6g8k" "traceId": "d048d9f59c20555c" } ``` 75.7s https://jaeger.proxy.matrix.org/trace/d048d9f59c20555c - 549ms `/backfill` request - 32.58s `/state_ids` request/messages
timing scriptEvery 61 minutes (to be outside of the state cache expiry), will call and time
/messages
for each room defined. And will do this with?limit=
500
,100
,20
to see if the amount of messages matters on the timing.Run in the browser. Define
let MATRIX_TOKEN = 'xxx';
before running the script.
```js let resultantTimingData = {}; async function fetchAndTimeMessagesFromRoom(roomId, numMessagesToFetch) { // via https://stackoverflow.com/a/8084248/796832 const requestId = (Math.random() + 1).toString(36).substring(7); let res; let timeBeforeRes; let timeAfterRes; let startTime; let endTime; try { startTime = Date.now(); timeBeforeRes = performance.now(); res = await fetch(`https://matrix-client.matrix.org/_matrix/client/r0/rooms/${roomId}/messages?dir=b&from=t466554-18118302_0_0_0_0_0_0_0_0&limit=${numMessagesToFetch}&filter=%7B%22lazy_load_members%22:true%7D&foo=${requestId}`, { headers: { 'Authorization': `Bearer ${MATRIX_TOKEN}` } }); } catch(err) { console.error(`Error getting result for ${roomId} at ${numMessagesToFetch} messages`, err); } finally { endTime = Date.now(); timeAfterRes = performance.now(); } const traceId = res?.headers?.['synapse-trace-id']; let data = {}; if (res?.ok) { data = await res.json(); } const timingEntry = { ok: res?.ok || false, status: res?.status, timing: timeAfterRes - timeBeforeRes, numEventsReturned: data?.chunk?.length, numStateEventsReturned: data?.state?.length, startTime, endTime, requestId, traceId }; return timingEntry; } (async () => { const ROOMS_TO_TEST = { 'matrix:matrix.org': '!OGEhHVWSdvArJzumhm:matrix.org', 'openwisp/general': '!RBzfoBeqYcCwLAAenz:gitter.im', 'ethereum/solidity-dev': '!poXqlbVpQfXKWGseLY:gitter.im', 'matrix-org/gitter': '!OvgDmYaPOFRRWnIdQe:half-shot.uk', 'MadLittleMods/new-room1': '!aMzLHLvScQCGKDNqCB:gitter.im' }; const NUM_MESSAGES_TO_FETCH_MATRIX = [ 500, 100, 20 ]; const ONE_MINUTE_MS = 60 * 1000; // The `_state_cache` in Synapse has 60 minute expiry so we chose something just bigger const DELAY_BETWEEN_TRIALS = 61 * ONE_MINUTE_MS; const REPEAT = 5; if (!MATRIX_TOKEN) { console.error('MATRIX_TOKEN is not defined. Please define this before running this script'); } for (let i = 0; i < REPEAT; i++) { for (let numMessagesToFetch of NUM_MESSAGES_TO_FETCH_MATRIX) { for (let roomKey of Object.keys(ROOMS_TO_TEST)) { const roomId = ROOMS_TO_TEST[roomKey]; const timingEntry = await fetchAndTimeMessagesFromRoom(roomId, numMessagesToFetch); // Default the result map so we don't run into null-pointer exception (NPE) issues if (!resultantTimingData[roomKey]) { resultantTimingData[roomKey] = {}; } if (!resultantTimingData[roomKey][numMessagesToFetch]) { resultantTimingData[roomKey][numMessagesToFetch] = []; } resultantTimingData[roomKey][numMessagesToFetch].push(timingEntry); console.log(`New result for ${roomKey} at ${numMessagesToFetch} messages:`, timingEntry); } // Wait for the caches to clear out before running another trial against the same room await new Promise((resolve) => { setTimeout(resolve, DELAY_BETWEEN_TRIALS); }) } } })() ``` ```js function getAvg(roomId, numMessages) { const timingEntries = data1[roomId][numMessages]; return timingEntries .filter((timingEntry) => timingEntry.ok) .reduce((total, timingEntry) => (total + timingEntry.timing), 0) / timingEntries.length; } getAvg('matrix:matrix.org', "500") ```matrix-messages-timing.js