matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.8k stars 2.13k forks source link

Optimize /keys/query endpoint #14706

Open clokep opened 1 year ago

clokep commented 1 year ago

This endpoint seems to do some funny things (and the code is hard to follow, but that doesn't seem to be a performance issue). Note that this endpoint is supported by workers (and goes the client_reader on matrix.org/Complement). See E2eKeysHandler.query_devices for the entry point.

Note that Synapse will call the same resync logic for a remove users keys/devices due to a few different operations (receiving to-device messages from an unknown device, a /keys/query request in some situations, maybe other situations).


reivilibre commented 1 year ago

Can we avoid bouncing this to the main process? (Can we have a different worker handle these queries?)

It doesn't seem like this would help us much at the moment: during the moments that we're having trouble, there isn't actually much CPU usage by other requests at all — it seems to me that moving it to another worker would just move the problem rather than improve performance. (not that getting it off the master is a bad thing, but it won't solve requests taking forever for M).

We seem to have a gap in CPU measurement to be honest — at 100%+ CPU, the ReplicationUserDeviceResyncRestServlet is only listed at about ~32%. Per-block metrics only add up to about ~2% (I think we must be under-reporting how much time is spent following replication streams, to be honest, given empirically we think it accounts for at least 30% of CPU usage since an otherwise-quiet worker sits at about 30%+ from memory).

clokep commented 1 year ago

Can we avoid bouncing this to the main process? (Can we have a different worker handle these queries?)

It doesn't seem like this would help us much at the moment: during the moments that we're having trouble, there isn't actually much CPU usage by other requests at all — it seems to me that moving it to another worker would just move the problem rather than improve performance. (not that getting it off the master is a bad thing, but it won't solve requests taking forever for M).

Oh good catch! If nothing else is happening on the worker than I agree this wouldn't help much (unless we also let it be shard-able?)

reivilibre commented 1 year ago

unless we also let it be shard-able

yep, in alignment there. But sharding it could be trickier than the time we have before christmas, so good to bear in mind but also would be good to keep an eye out for alternatives..

reivilibre commented 1 year ago

Can we parallelize the per-user queries (or add a new endpoint to make them more efficient)?

This sounds like it'd be good. Currently we make one /user/devices/%s federation request per user and I'm not sure there exist alternative endpoints which batch this up. Batching up the replication requests into one is quite easy though.

richvdh commented 1 year ago

Can we avoid bouncing this to the main process? (Can we have a different worker handle these queries?)

It doesn't seem like this would help us much at the moment: during the moments that we're having trouble, there isn't actually much CPU usage by other requests at all — it seems to me that moving it to another worker would just move the problem rather than improve performance. (not that getting it off the master is a bad thing, but it won't solve requests taking forever for M).

I wouldn't discount it so quickly. The master tends to be a bottleneck for everything, which means that (a) it pinning its CPU is particularly noticeable, and (b) doing lots of odd jobs rather than a few core ones tends to make it more prone to under-reporting of CPU usage.

We seem to have a gap in CPU measurement to be honest

Indeed. Generally anything low-level isn't well tracked; for example reading and parsing an HTTP request (we can't attribute that to a particular servlet, because we don't yet know which it will be).

Per-block metrics only add up to about ~2%

Per-block metrics aren't meant to add up to 100%. Most code in synapse isn't covered by one of the measurement blocks (so won't be counted by per-block metrics), and some measurement blocks are nested in others (so will be double-counted if you try to add them up).

(I think we must be under-reporting how much time is spent following replication streams, to be honest, given empirically we think it accounts for at least 30% of CPU usage since an otherwise-quiet worker sits at about 30%+ from memory).

Yes, if I remember rightly a lot of the low-level redis stuff isn't tracked in the cpu metrics.