"enable_outbound" goes out of control and pegs the CPU

IngwiePhoenix commented 1 year ago

Background information

Dendrite version or git SHA: 0.10.8+76db8e90
Monolith or Polylith?: Monolith
SQLite3 or Postgres?: Postgres
Running in Docker?: No
go version: 1.19.3
Client used (if applicable): Element, Cinny

Description

What is the problem: When presence.enable_outbound is enabled, my VPS goes under complete load, showing an "Average load" of 40. In htop, this means all cores are nothing but red bars. That is, if my SSH connection even survives...
Who is affected: My server (and my VPS' hypervisor, probably)
How is this bug manifesting: presence.enable_outbound: true shows a lot of m.presence events being sent. My guess: There is no rate limiting in place, meaning it will just infinitly barf them out - and, depending on how many servers/rooms are involved, this can easily become a lot. I am not in #matrix:matrix.org, but only in a few small channels - the Dendrite channels amongst them, IPFS, a few old Gitter ones and Telegram 1:1s.
When did this first appear: Today. I was moving between verrsions and upgraded from 0.10.6 to .7 and then .8. Once the bug manifested, I changed to building from main in hopes of this being fixed...Well, it's not.

Steps to reproduce

git clone ...
git checkout tags/v0.10.8
mv .git .gitx # Works around a broken build
./build.sh
mv .gitx .git
nano .../dendrite.yaml # set presence.enable_outbound: true
systemctl start matrix-dendrite # I use a SystemD unit

Ultimatively, I'd like to be able to see some performance stats logged to tell what exactly is making my server go absolutely crazy. For now, I have this disabled and it seems fine. But for better or for worse, I would like to enable it eventually.

Thanks and kind regards, Ingwie

spaetz commented 1 year ago

Fortunately dendrite has a profiler built in: https://github.com/matrix-org/dendrite/blob/main/docs/development/PROFILING.md and it is pretty decent in helping you to find bottle necks.

P.S. If you use a systemd service to start dendrite, enable the profile by adding the Environment var to the systemd service file:

[Service]
Environment="PPROFLISTEN=localhost:65432"
...

IngwiePhoenix commented 1 year ago

Thanks, will do! Maybe this will help me find out what the issue is. Never worked with pprof, best time as ever to try it out :)

IngwiePhoenix commented 1 year ago

pprof.pb.gz

Gottem :)

The first pprof I downloaded was only 12kb, this one is almost 10x as large, so something definitively happened. This profiling output is from having turned presence.outboundenable_outbound: true. And once I had set it, just as expected, the process went to peg the CPU immediately.

However, looking at the profile from the webUI, I am not sure what the root cause of this is. That is why I am posting it here, maybe someone more versed in Go can spot the issue. :)

Thanks and kind regards, Ingwie

spaetz commented 1 year ago

3.2sec out of 32sec are spent in calls RemoveServerFromBlacklist which seems quite a lot.

But the majority of time is spent in:

pq (*conn) auth 8.58 secs whereof nearly 7 secs are spent calculating sha256 when salting passwords. So I would posit that opening database connections to your postgres seems weirdly expensive in your case. These calls are part of (Database).CleanEDUs. Is there anything special about your postgres connection?

That having said, this are just ideas I get when eyeballing the flame graph, and not based on any knowledge of the code.... Also pq write seems similarly expensiv...

IngwiePhoenix commented 1 year ago

Thanks! I'll go probe my postgres config and see if there is anything I can optimize just in case. That said, I never even created a server blacklist?...

spaetz commented 1 year ago

No, servers are blacklisted when they go offline, present invlid keys etc... So that is pretty much standard operation, but taking quite a bit of time. Is the database in the same vserver and are you connecting to it using a user/password combination?

We should never spend nearly 30% of all time trying to authenticate ourselves to the db.

IngwiePhoenix commented 1 year ago

Yep - its all on the same server.

root@birb:~# ps -ef | grep dendrite
dendrite 1196414       1  7 06:17 ?        00:19:05 /opt/matrix-dendrite/bin/dendrite-monolith-server -config /srv/matrix-dendrite/dendrite.yaml
postgres 1346792   54236  0 10:29 ?        00:00:09 postgres: 15/main: dendrite dendrite ::1(55504) idle
postgres 1352561   54236  0 10:44 ?        00:00:01 postgres: 15/main: dendrite dendrite ::1(51124) idle
postgres 1353244   54236  0 10:45 ?        00:00:01 postgres: 15/main: dendrite dendrite ::1(53560) idle
postgres 1353245   54236  0 10:45 ?        00:00:00 postgres: 15/main: dendrite dendrite ::1(53562) idle
postgres 1354267   54236  3 10:48 ?        00:00:00 postgres: 15/main: dendrite dendrite ::1(57206) idle
root     1354660 1354274  0 10:48 pts/0    00:00:00 grep dendrite

S7evinK commented 1 year ago

Good find @spaetz Currently when cleaning EDUs we are querying every EDU NID separately, which is X queries for X send EDUs, and that's not really that great.

To possibly avoid salting passwords that much, you can try using postgres://userName@/dbName?host=/run/postgresql as the connection string, if your Dendrite has access to the unix socket Postgres is listening on (if it does)

IngwiePhoenix commented 1 year ago

I completely forgot Postgres has UNIX domain socket support. I'll check and see if that helps and let you know once I did. :)

IngwiePhoenix commented 1 year ago

I tried it.

It took somewhat longer but eventually the same result would happen. I haven't produced a profiler dump yet though, will do so later. But hey, a slower ramp is still something! :) Although it leads to the same result unfortunately.

matrix-org / dendrite