Optimize /keys/query endpoint

clokep commented 1 year ago

This endpoint seems to do some funny things (and the code is hard to follow, but that doesn't seem to be a performance issue). Note that this endpoint is supported by workers (and goes the client_reader on matrix.org/Complement). See E2eKeysHandler.query_devices for the entry point.

It linearizes based on requesting user / device (but since you can run multiple of these, I'm not sure how useful that is).
It seems to reasonably separate queries into local and remote users, checking caches and such.
For any uncached remote queries it fans-out per destination (via background tasks) and calls E2eKeysHandler._query_devices_for_destination.
- This then iterates through each user for that destination and attempts to resync devices via the /user/devices/{user_id} federation endpoint.
- If workers are used, this calls a replication endpoint on the main process.
- In some situations this might fail and Synapse then tries calling /user/keys/query once per destination.

Note that Synapse will call the same resync logic for a remove users keys/devices due to a few different operations (receiving to-device messages from an unknown device, a /keys/query request in some situations, maybe other situations).

Can we avoid bouncing this to the main process? (Can we have a different worker handle these queries?)
Can we shard this work? Likely by destination. (Note the /user/keys/query is not sent to the main process.)
Can we parallelize the per-user queries (or add a new endpoint to make them more efficient)?

reivilibre commented 1 year ago

Can we avoid bouncing this to the main process? (Can we have a different worker handle these queries?)

It doesn't seem like this would help us much at the moment: during the moments that we're having trouble, there isn't actually much CPU usage by other requests at all — it seems to me that moving it to another worker would just move the problem rather than improve performance. (not that getting it off the master is a bad thing, but it won't solve requests taking forever for M).

We seem to have a gap in CPU measurement to be honest — at 100%+ CPU, the ReplicationUserDeviceResyncRestServlet is only listed at about ~32%. Per-block metrics only add up to about ~2% (I think we must be under-reporting how much time is spent following replication streams, to be honest, given empirically we think it accounts for at least 30% of CPU usage since an otherwise-quiet worker sits at about 30%+ from memory).

clokep commented 1 year ago

Can we avoid bouncing this to the main process? (Can we have a different worker handle these queries?)

It doesn't seem like this would help us much at the moment: during the moments that we're having trouble, there isn't actually much CPU usage by other requests at all — it seems to me that moving it to another worker would just move the problem rather than improve performance. (not that getting it off the master is a bad thing, but it won't solve requests taking forever for M).

Oh good catch! If nothing else is happening on the worker than I agree this wouldn't help much (unless we also let it be shard-able?)

reivilibre commented 1 year ago

unless we also let it be shard-able

yep, in alignment there. But sharding it could be trickier than the time we have before christmas, so good to bear in mind but also would be good to keep an eye out for alternatives..

reivilibre commented 1 year ago

Can we parallelize the per-user queries (or add a new endpoint to make them more efficient)?

This sounds like it'd be good. Currently we make one /user/devices/%s federation request per user and I'm not sure there exist alternative endpoints which batch this up. Batching up the replication requests into one is quite easy though.

richvdh commented 1 year ago

Can we avoid bouncing this to the main process? (Can we have a different worker handle these queries?)

It doesn't seem like this would help us much at the moment: during the moments that we're having trouble, there isn't actually much CPU usage by other requests at all — it seems to me that moving it to another worker would just move the problem rather than improve performance. (not that getting it off the master is a bad thing, but it won't solve requests taking forever for M).

I wouldn't discount it so quickly. The master tends to be a bottleneck for everything, which means that (a) it pinning its CPU is particularly noticeable, and (b) doing lots of odd jobs rather than a few core ones tends to make it more prone to under-reporting of CPU usage.

We seem to have a gap in CPU measurement to be honest

Indeed. Generally anything low-level isn't well tracked; for example reading and parsing an HTTP request (we can't attribute that to a particular servlet, because we don't yet know which it will be).

Per-block metrics only add up to about ~2%

Per-block metrics aren't meant to add up to 100%. Most code in synapse isn't covered by one of the measurement blocks (so won't be counted by per-block metrics), and some measurement blocks are nested in others (so will be double-counted if you try to add them up).

(I think we must be under-reporting how much time is spent following replication streams, to be honest, given empirically we think it accounts for at least 30% of CPU usage since an otherwise-quiet worker sits at about 30%+ from memory).

Yes, if I remember rightly a lot of the low-level redis stuff isn't tracked in the cpu metrics.

matrix-org / synapse

Optimize /keys/query endpoint #14706