Open IngwiePhoenix opened 1 year ago
Fortunately dendrite has a profiler built in: https://github.com/matrix-org/dendrite/blob/main/docs/development/PROFILING.md and it is pretty decent in helping you to find bottle necks.
P.S. If you use a systemd service to start dendrite, enable the profile by adding the Environment var to the systemd service file:
[Service]
Environment="PPROFLISTEN=localhost:65432"
...
Thanks, will do! Maybe this will help me find out what the issue is. Never worked with pprof, best time as ever to try it out :)
Gottem :)
The first pprof I downloaded was only 12kb, this one is almost 10x as large, so something definitively happened. This profiling output is from having turned presence.outboundenable_outbound: true
. And once I had set it, just as expected, the process went to peg the CPU immediately.
However, looking at the profile from the webUI, I am not sure what the root cause of this is. That is why I am posting it here, maybe someone more versed in Go can spot the issue. :)
Thanks and kind regards, Ingwie
3.2sec out of 32sec are spent in calls RemoveServerFromBlacklist which seems quite a lot.
But the majority of time is spent in:
pq (*conn) auth 8.58 secs whereof nearly 7 secs are spent calculating sha256 when salting passwords. So I would posit that opening database connections to your postgres seems weirdly expensive in your case. These calls are part of (Database).CleanEDUs. Is there anything special about your postgres connection?
That having said, this are just ideas I get when eyeballing the flame graph, and not based on any knowledge of the code.... Also pq write seems similarly expensiv...
Thanks! I'll go probe my postgres config and see if there is anything I can optimize just in case. That said, I never even created a server blacklist?...
No, servers are blacklisted when they go offline, present invlid keys etc... So that is pretty much standard operation, but taking quite a bit of time. Is the database in the same vserver and are you connecting to it using a user/password combination?
We should never spend nearly 30% of all time trying to authenticate ourselves to the db.
Yep - its all on the same server.
root@birb:~# ps -ef | grep dendrite
dendrite 1196414 1 7 06:17 ? 00:19:05 /opt/matrix-dendrite/bin/dendrite-monolith-server -config /srv/matrix-dendrite/dendrite.yaml
postgres 1346792 54236 0 10:29 ? 00:00:09 postgres: 15/main: dendrite dendrite ::1(55504) idle
postgres 1352561 54236 0 10:44 ? 00:00:01 postgres: 15/main: dendrite dendrite ::1(51124) idle
postgres 1353244 54236 0 10:45 ? 00:00:01 postgres: 15/main: dendrite dendrite ::1(53560) idle
postgres 1353245 54236 0 10:45 ? 00:00:00 postgres: 15/main: dendrite dendrite ::1(53562) idle
postgres 1354267 54236 3 10:48 ? 00:00:00 postgres: 15/main: dendrite dendrite ::1(57206) idle
root 1354660 1354274 0 10:48 pts/0 00:00:00 grep dendrite
Good find @spaetz Currently when cleaning EDUs we are querying every EDU NID separately, which is X queries for X send EDUs, and that's not really that great.
To possibly avoid salting passwords that much, you can try using postgres://userName@/dbName?host=/run/postgresql
as the connection string, if your Dendrite has access to the unix socket Postgres is listening on (if it does)
I completely forgot Postgres has UNIX domain socket support. I'll check and see if that helps and let you know once I did. :)
I tried it.
It took somewhat longer but eventually the same result would happen. I haven't produced a profiler dump yet though, will do so later. But hey, a slower ramp is still something! :) Although it leads to the same result unfortunately.
Background information
go version
: 1.19.3Description
presence.enable_outbound
is enabled, my VPS goes under complete load, showing an "Average load" of 40. Inhtop
, this means all cores are nothing but red bars. That is, if my SSH connection even survives...presence.enable_outbound: true
shows a lot ofm.presence
events being sent. My guess: There is no rate limiting in place, meaning it will just infinitly barf them out - and, depending on how many servers/rooms are involved, this can easily become a lot. I am not in#matrix:matrix.org
, but only in a few small channels - the Dendrite channels amongst them, IPFS, a few old Gitter ones and Telegram 1:1s.main
in hopes of this being fixed...Well, it's not.Steps to reproduce
Ultimatively, I'd like to be able to see some performance stats logged to tell what exactly is making my server go absolutely crazy. For now, I have this disabled and it seems fine. But for better or for worse, I would like to enable it eventually.
Thanks and kind regards, Ingwie