Closed lnicola closed 2 years ago
Thanks for the report and the prometheus screenshots.
For the suspicious change in CPU usage, it'd be worth looking at the broken down CPU graphs to see if there's anything specific. What do your "per-block metrics" and "requests" sections look like---is there any particular bit of the application there that looks to be suddenly churning up CPU?
It might also be instructive to see the "databases" section too.
From the graphs and the dates you give, it looks like there might be a performance regression in 1.57. To confirm this, can you roll back to 1.56 to see if performance improves?
Per-block metrics, looks like state resolution:
Requests:
Databases:
Possibly the same underlying cause as #12547?
I tried to roll back and now it won't start: Cannot use this database as it is too new for the server to understand
.
Was the server in use between the 28th of April and the 1st of May? The graphs show surprisingly little activity.
Yes, as far as I know. Even if not actively in use, it should have been present in some federated rooms.
What exactly is this GC doing?
I see Synapse in the D
state (waiting for I/O), but it doesn't seem to be touching the disk.
Yeah, this does look very similar to #12547.
Hey, in case you don't catch my response in the other issue, did you want to try disabling federation temporarily and see if the issues go away? If they do, you're most likely looking at the same problem as me.
I can't really disable federation, but I'm pretty sure it would fix it.
If it's like the issue is for me, you'll get to a point where your server is basically unusable unless you do. (It took about a month to get there)
Yeah, it does this every couple of minutes, so it's already unusable.
@prurigro it seems you were on to something.
Well, this is embarrassing. I had registration open on my server for about two weeks, and I ended up with 3810 spam accounts. They all follow one of these two naming conventions:
spambot
, followed by 10 random letters and digitsI'm not sure what they were doing (besides annoying people). I ran the following query to check the rooms they've joined:
# select room_id, name, canonical_alias, count(*) as count
from state_groups_state
inner join room_stats_state using (room_id)
group by room_id, name, canonical_alias
order by count desc;
I think these are the more interesting ones:
room_id | name | canonical_alias | count
-------------------------------------------------+----------------------------------------+-----------------------------------+----------
!twohoPqivntpjGWCJZ:matrix.org | My Milf Waifu . com | | 39882983
!naXxpfvVqaRoeCbiug:matrix.org | [redacted, offensive] | #[redacted]:matrix.org | 15630320
!AZiuodkxUdoGQVoeUX:matrix.org | dfg | #bogaggooa:matrix.org | 3774643
!vRGLvqJYlFvzpThbxI:matrix.org | Furry Tech | #furrytech:matrix.org | 2011859
!ehXvUhWNASUkSLvAGP:matrix.org | | | 151291
!rWyGejKmuZJaExbUOf:matrix.org | 111 | #sneed1:matrix.org | 145619
!rjNeouFqUBGzexdVMc:g33k.se | Otaku [OLD] | | 22250
!iCrqcrLOzJUNcyHTjR:matrix.kharkiv.dcomm.net.ua | Новини | #news:matrix.kharkiv.dcomm.net.ua | 20175
!WlOOWOsGpBBJNXqwSY:stopdronebl.org | | | 14037
Note the huge number of events in the first one.
I tried to join most of them, but it didn't work, or they don't exist any more.
I deactivated the spam accounts (curl -XPOST -H Authorization: Bearer REDACTED -H Content-Type: application/json -d { "erase": false } http://localhost:8008/_synapse/admin/v1/deactivate/USERNAME
), using xargs
to run 100 of these in parallel. It was pretty slow, but finished in about one hour or two.
I then tried to also erase their data (same API call, but with erase
set to true
), but, even with only 10 calls in parallel, these never finish. The server is still unresponsive. Looking at the SQL that's running, the queries I've seen were fine (had indexes and ran quickly). It's just spending time in Synapse for some reason, maybe some kind of "N + 1 selects" problem, or just swapping (I just added an extra 4 GB of swap).
My suspicion is that it will work fine if I purge the two larger rooms. I'm not doing it yet because I'd rather try to completely deactivate the accounts first, and maybe one of the developers would like to investigate the slow deletion.
I'm now trying to purge the largest room, but nothing happens.
Deleting the !vRGLvqJYlFvzpThbxI:matrix.org
room took 734 minutes and kicked 124 users. I thought deactivate
would have removed them from the room. I'm still trying to delete the spam users, but it only managed to get through 18 of them.
I tried to roll back and now it won't start:
Cannot use this database as it is too new for the server to understand
.
I'm also seeing this and can't seem to get my synapse instance back even I roll back my DB and my Synapse version. Any ideas on getting this working again?
@lnicola OK yeah, I'm thinking this pretty well confirms malicious bots. I'd had open registration for years without really being noticed, but it seems like I ended up on some list of servers at some point and new registrations started to happen. I only hit around 300 accounts total and know for sure that about 30 were real, the majority of the others weren't in any channels so I'm not sure if they were parked for later use or were there to make finding the active ones more difficult.
I noticed my domain had actually been banned from the archlinux.org matrix server, so I assume whatever they're doing includes interacting with other servers. I ended up wiping the database and starting fresh with my top level domain (which hasn't been banned anywhere), and this time registrations are closed. Same configs, same server, same real people and everything is lightning fast now- it's kind of a shame we lost all that history, but what can you do.
One other thing I noticed when deploying a fresh install on the old domain was that I immediately started getting federation activity from dozens of servers, most of which were getting rejected due to not having a proper certificate.
Update: trying to delete !AZiuodkxUdoGQVoeUX:matrix.org
returned an Internal server error
after 1814 minutes. I'm still trying to delete the account data (with erase
), but it only gets through 10 or 20 80 a day or so.
I tried py-spy
, but it doesn't say much:
I tried
py-spy
, but it doesn't say much:
By default, each py-spy collects 100 samples per second, ie. each sample is 10 ms apart.
There are 6,543 samples in the database thread there and 6,643 samples in _maybe_gc
, which is an unusual amount of time to be spending in those two places.
That's right, I'm seeing gen 2 GC times on the order of minutes:
That's also visible in one of the screenshots in the issue description.
@babolivier sorry for the ping, but I saw your comments in #12778 and #12788.
I'm also having size-related eviction rates that seem quite high for get_current_state_ids
and stateGroupsMemberCache
:
(TL;DR of the issue is that some 2000 spam accounts joined two large rooms. I've deactivated the accounts, but I'm having trouble deleting the rooms.)
I did increase cache.global_factor
to 10
, but it doesn't seem to help. Can Synapse reload that parameter at run-time (on e.g. systemctl reload matrix-synapse
). Should I bump it more?
EDIT: yeah, restarting Synapse appears to have increased the cache size:
Another update: I added more memory and Synapse managed to delete the users from the largest room. It was quite slow, even with force_purge
. Right now, the room is empty, and Synapse is removing its users from users_in_public_rooms
, including the ones from other servers. This happens one user at a time, and is quite slow -- 1-2/s. With around 141 192 users there, this might take one more day or two.
It finally finished, but left a lot of junk in current_state_delta_stream
, state_groups
and state_groups_state
. Performance is fine again.
Since the issue was determined to be spambots and not a bug, I am going to go ahead and close this.
@H-Shay I still think the slow account deactivation is a real issue.
No problem, would you be willing to open a new issue specific to the slow account deactivation?
Description
After upgrading my HS, everything got very slow. Loading or uploading images and opening the room list take a couple of minutes. Every so often, Element complains it can't reach the server.
Logs don't tell much:
I do have Prometheus set up, but I'm not sure what to look for and the historical data is mostly gone.
I upgraded from 1.56 to 1.57 on the 2nd of May, and to 1.58 on the 4th of May.
Version information
If not matrix.org:
Version:
{"server_version":"1.58.1","python_version":"3.8.10"}
-- this took a minute to respond the first timeInstall method: Ubuntu package
Platform: Ubuntu 20.04, 3 vCPUs (Hetzner CPX21), 4 GB RAM, 4 GB swap, PostgreSQL 14.3