Open erulabs opened 4 years ago
I am working on increasing the scalability of Synapse and have been investigating it's runtime characteristics. I plan on writing a few PRs, but for now am just gathering information.
Awesome to hear this! If you have any questions please give a shout in #synapse-dev:matrix.org (and you might want to take a look at the contributing doc if you haven't already).
The main thing I'll point out that smaller changes are likely to get reviewed and merged faster! 🚀
I wonder if calls to get_thread_resource_usage() shouldn't be disabled by default (or at least, only enabled when prometheus metrics are enabled).
This sounds like a reasonable thing to do (assuming those metrics aren't used in any other place).
I'll let others comment on the Docker bit since that's not something I know much about.
@erulabs Any chance you've made progress with this?
I started on a PR but things got quite messy. Still looking forward to getting to this - should be within the next two weeks.
I'm not entirely convinced this is a great plan. I can believe getrusage
is a significant overhead, but otoh those metrics are absolutely essential for figuring out what Synapse is doing when it does start spinning CPU. (Also: note that they are not just used for prometheus metrics: they also end up in 'Processed' log lines).
Arguably if we are calling getrusage
thousands of times, the real problem is simply that we are context-switching too much.
Description
Hello!
I am working on increasing the scalability of Synapse and have been investigating it's runtime characteristics. I plan on writing a few PRs, but for now am just gathering information.
One interesting note is Synapse 1.18 spends about 20% of its time calling
getrusage
. After about 1000 "sendto" calls (ie: about 1k HTTP requests), Synapse has called getrusage 40k times! Usingstrace -c -p PID_OF_SYNAPSE
and passing a few basic calls (list users in room, incremental_sync, etc) recreated this time after time.Here is some sample output from
strace -c -p ...
:Note that getrusage is called ~10x more than accept4 (so on average, we do about 8 or 9 getrusage calls per http request).
I believe most of the calls come from the Measure() class here https://github.com/matrix-org/synapse/blob/9d1e4942ab728ebfe09ff9a63c66708ceaaf7591/synapse/util/metrics.py#L127
The stats generated by get_resource_usage() reveal that by far and away, the vast vast majority of CPU time is spent in
{method="GET",servlet="SyncRestServlet",tag="incremental_sync"}
which is possibly reasonable - and not entirely surprising. The information from get_resource_usage() is probably not being used terribly often - as it appears to be reporting the vast majority of load in one known place.I wonder if calls to get_thread_resource_usage() shouldn't be disabled by default (or at least, only enabled when prometheus metrics are enabled). This one line of code is responsible for the vast majority of syscalls Synapse makes: https://github.com/matrix-org/synapse/blob/931b02684481fb6b5daefd9218baf6a4b0b941f6/synapse/logging/context.py#L452
I'll try to update this issue if I can get some stats with getrusage disabled. I suspect there is also work to do in SyncRestServlet/incremental_sync (it seems to be responsible for the vast, vast majority of load). Note that the futex calls and errors are probably due to the FUSE filesystem mounting the media storage which is a default of https://github.com/spantaleev/matrix-docker-ansible-deploy - I plan on removing (in favor of the s3 python driver) that and ideally this resolves the futex errors as well.
Steps to reproduce
Launch Synapse - use
strace -c -p PID_OF_SYNAPSE_HOMESERVER
to generate table of syscalls.Version information
matrix.nova.chat
server_version: "1.18.0" python_version: "3.7.8"
Install method: https://github.com/spantaleev/matrix-docker-ansible-deploy
Platform: Linux / Ubuntu 18.04 / 4.15