matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.83k stars 2.13k forks source link

My HS has accumulated thousands of unreferenced state groups #3364

Open richvdh opened 6 years ago

richvdh commented 6 years ago

... which are filling up my disk :(


To check if you are also affected, run this query:

select count(*) from state_groups sg
    left join event_to_state_groups esg on esg.state_group=sg.id
    left join state_group_edges e on e.prev_state_group=sg.id
where esg.state_group is null and e.prev_state_group is null;

if you see numbers in the thousands, then it is this issue. Otherwise, you're not affected by this issue.

richvdh commented 6 years ago

worse, when I go to purge history, the unreferenced state groups start turning into non-delta state groups, which makes the whole thing worse.

richvdh commented 6 years ago

redundant state groups:

create temporary table unreferenced_state_groups as 
select sg.id, sg.room_id from
    state_groups sg
    left join event_to_state_groups esg on esg.state_group=sg.id
    left join state_group_edges e on e.prev_state_group=sg.id
where esg.state_group is null and e.prev_state_group is null;

(empirically most of them seem to be coming from HQ)

krombel commented 6 years ago

I seem to have them as well. I created that table on my system as well and get the following response:

synapse=# SELECT COUNT(*) FROM state_groups_to_drop;
 count
-------
  2272
(1 row)

Just to note: I have not run any purge commands yet

richvdh commented 6 years ago

3625 might be related

richvdh commented 6 years ago

neil I don't think this can be a p2; it's a real blocker on cleaning up disk space

richvdh commented 6 years ago

I've done a little bit of digging into why this happens. Other than #3625, another cause (which probably bites matrix.org heavily, but others less so) is #3791.

richvdh commented 6 years ago

Another occasion that (I think) this happens is when we have a fork in the DAG, with different state on the two sides of the fork, and the next event (which heals the fork) is itself another state event. We create a new state group when we state-resolve the two sides for the fork (which is important for caching state res), but that SG is never actually (directly) used because we then create another SG to include the updated state.

sargon commented 6 years ago

We have kind of a big disk filling database too ~(45G) and ~40 Users. We started to purge the history some time ago, monthly, so the db should contain only the data of the last 365 days with a sloop of 30 days. So I was curious have many tuples in state_groups_state would be affected in our database, so I extended your temporary table query a little bit:

create temporary table unreferenced_state_groups as 
select sg.id, sg.room_id, count(sgs.*) as cgs_cnt from
    state_groups sg
    left join event_to_state_groups esg on esg.state_group=sg.id
    left join state_group_edges e on e.prev_state_group=sg.id
    left join state_groups_state sgs on sgs.state_groups_state = sg.id 
where esg.state_group is null and e.prev_state_group is null
group by sg.id;
select sum(cgs_cnt) from unreferenced_state_groups;

Which resulted in 1.388.475 affected tupels, which is kind of nothing in contrast to 84.141.600 tupels in the table. So this Is definitely a thing, but my guess is that we have other waste in that database, or is this a "normal/to be expected" size?

ara4n commented 6 years ago

we’re investigating this with https://github.com/erikjohnston/rust-synapse-compress-state (not yet ready for use) which looks to also provide decent size savings in general.

sargon commented 6 years ago

I see ...

Number of entries: 345193
Number of rows: 70955217
Number of rows compressed: 2375250

Okay. If we assume we had those optimizations, we would have roughly 16.000.000 tuples in that table, then the affected tuples above get a weight.

ara4n commented 6 years ago

right, so that sounds it can be shrunk by 30x if i’m reading it correctly and if the tool is correct. hopefully will have something that can apply the change in the near future.

richvdh commented 6 years ago

@ara4n: please can we keep chat about how we might better compress state groups (once they exist) out of this bug, which is about state groups which shouldn't really exist in the first place? Indeed, if we apply the optimisations suggested in erik's script without addressing this bug, the problem will get even worse, because we'll have even more orphaned state groups which need de-deltaing.

richvdh commented 6 years ago

@sargon:

Which resulted in 1.388.475 affected tupels, which is kind of nothing in contrast to 84.141.600 tupels in the table.

Those 1.3M tuples will just be the deltas from the previous state groups - probably only one or two rows per state group. The problem comes when a state group is removed, which means that any other state group which references it will have to be converted from delta storage to a absolutes - ie, we will have to store every single state event for the room for each of those state groups.

Suppose we have three state groups in a room, 1, 2, and 3. 1 is the first state group, and 2 and 3 are both stored as deltas from 1:

  1
 /  \
2    3

SG1 and SG3 are both used for a number of events in the room, but as per this bug, SG2 is unused. Now we purge some events from this room. SG1 and SG3 are detected as unused and deleted. However, SG2 is losing its parent, so needs "de-deltaing".

Multiply this effect by 1.3M, and you have a real problem.

ghost commented 5 years ago

hi,

i believe i'm facing the same problem described a year earlier in this issue: the whole database weigh 14 GB (7 users registered only, no huge rooms joined...)

# SELECT pg_size_pretty( pg_database_size('matrix_prod') );
 pg_size_pretty 
----------------
 14 GB
(1 row)

here are the biggest tables:

matrix_prod=# select schemaname as table_schema,
    relname as table_name,
    pg_size_pretty(pg_total_relation_size(relid)) as total_size,
    pg_size_pretty(pg_relation_size(relid)) as data_size,
    pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid))
      as external_size
from pg_catalog.pg_statio_user_tables
order by pg_total_relation_size(relid) desc,
         pg_relation_size(relid) desc
limit 10;
 table_schema |        table_name         | total_size | data_size | external_size 
--------------+---------------------------+------------+-----------+---------------
 public       | state_groups_state        | 4724 MB    | 3134 MB   | 1590 MB
 public       | event_json                | 2857 MB    | 2502 MB   | 354 MB
 public       | received_transactions     | 1221 MB    | 697 MB    | 524 MB
 public       | stream_ordering_to_exterm | 1193 MB    | 672 MB    | 520 MB
 public       | event_auth                | 907 MB     | 633 MB    | 274 MB
 public       | events                    | 811 MB     | 366 MB    | 445 MB
 public       | event_edges               | 746 MB     | 261 MB    | 485 MB
 public       | room_memberships          | 527 MB     | 284 MB    | 243 MB
 public       | event_reference_hashes    | 429 MB     | 200 MB    | 229 MB
 public       | state_events              | 312 MB     | 221 MB    | 91 MB
(10 rows)

isn't there something to do? it's labeled P1 and i think truly critical.

-- edit 10 days later DB weight is now 16 GB 😩

sargon commented 5 years ago

Coming back to this topic. We hit the magical ~100GB table size last week. I got it under control with the compress-state applied on every room, which took us only 4 days. After an vacuum full the database size shrinked down to something around 22GB (only the state_group_state table). To my knowledge synapse has been patched to remove the loose ends during history purching, so that is contained. But ...

I just run the queries from above and they still find unreferenced state groups (~10k), since my knowledge about the database schema is kind of nearly none existence, can you please provide us with a query to safely getting rid of those rows.

mytbk commented 4 years ago

There are rooms that are hard to compress, the following room still has 35M rows after compressed:

Fetching state from DB for room '!QtykxKocfZaZOUrTwp:matrix.org'...
  [12m] 49667078 rows retrieved
Got initial state from database. Checking for any missing state groups...
No missing state groups
Number of state groups: 10136
Number of rows in current table: 49665768
Compressing state...
[00:02:29] ____________________ 10136/10136 state groups
Number of rows after compression: 35295076 (71.07%)
Compression Statistics:
  Number of forced resets due to lacking prev: 51
  Number of compressed rows caused by the above: 319150
  Number of state groups changed: 6546
grinapo commented 4 years ago

(Sidenote: irc bridged rooms are far beyond MatrixHQ now, with m.room.member events all over the place.)

richvdh commented 4 years ago

Another factor in this is that, as of #6320, we now create a new state group for any new state event which is submitted via the C-S API, even if that event is not accepted.

richvdh commented 4 years ago

The long and the short of this is that I think we need a script which will gradually walk the state_groups table, looking for redundant state groups and removing them.

(it would also be nice to stop some of the state groups being created in the first place, but that's a bit harder.)

richvdh commented 4 years ago

we will hope to revisit this in the future as part of a larger piece of work on the database

MurzNN commented 4 years ago

I have same problem in our public homeserver ru-matrix.org with about 20 local users - after 2 years of working got total size of pgsql table is 123GB! Executing rust-synapse-compress-state on most large rooms (Matrix HQ, Riot-*, KDE, etc) don't help dramatically reduce size of state_groups_state table, it still large.

Here is current per-table statistics:

relation total_size rows
state_groups_state 51 GB 274929152
event_json 22 GB 14572037
events 8853 MB 14609418
event_edges 8477 MB 15573624
event_reference_hashes 4519 MB 14520730
stream_ordering_to_exterm 4340 MB 2154655
event_auth 3719 MB 17286570
event_search 3637 MB 8290099
received_transactions 2815 MB 9915562
event_to_state_groups 2555 MB 11454427
room_memberships 2102 MB 5461632
current_state_delta_stream 1306 MB 6627053
state_events 1232 MB 5625349
current_state_events 958 MB 1272631
cache_invalidation_stream 850 MB 4414804
receipts_linearized 794 MB 249685
presence_stream 771 MB 190234
state_groups 604 MB 2960779
event_forward_extremities 347 MB 2129
state_group_edges 337 MB 3225766
intelfx commented 4 years ago

Same thing here.

intelfx.name is a single user HS, but I run my own IRC bridge connecting to several high volume channels. My state_groups_state has upwards of 100M rows, and running the compression tool on all rooms produces ~2G of sql output. Woah.

MurzNN commented 4 years ago

running the compression tool on all rooms produces ~2G of sql output.

How did you run script on all rooms? I see that it accept only one room per argument, and no bulk mode.

Also how much rows stay in table after compression finished?

MurzNN commented 4 years ago

@intelfx also can you share Prometheus performance stat of your homeserver - total CPU usage and per background job? I have problem that Synapse eats 100% of CPU all the time from last month, and can't understand, is this related to current issue, or separate issue?

intelfx commented 4 years ago

@MurzNN

How did you run script on all rooms? I see that it accept only one room per argument, and no bulk mode.

You're correct. I just hacked it up:

-- get_candidate_rooms.sql
SELECT room_id
FROM (
        SELECT room_id, count(*) AS count
        FROM state_groups_state
        GROUP BY room_id
        ORDER BY count DESC
) AS groups
WHERE count > 100;
$ psql synapse -Aqt -f get_candidate_rooms.sql >rooms

$ parallel --bar 'id={}; target/release/synapse-compress-state -p "host=/run/postgresql user=synapse dbname=synapse" -r "$id" -t -o "out.$id.sql"' <rooms
(lots of output which I did not keep)

$ parallel --bar 'f={}; psql synapse -f "$f" -o /dev/null && rm -f "$f"' ::: out.*.sql

Note that the tool is massively multi-threaded and CPU heavy. On my 16-core box it took almost an hour to process ~150 rooms.

Also I used GNU parallel because the initial data loading is single-threaded and I had RAM to spare, but you might be better off doing it sequentially.

Also how much rows stay in table after compression finished?

19M rows in state_group_state as of now.

intelfx commented 4 years ago

@MurzNN

also can you share Prometheus performance stat of your homeserver - total CPU usage and per background job? I have problem that Synapse eats 100% of CPU all the time from last month, and can't understand, is this related to current issue, or separate issue?

Well, here it is: https://intelfx.name/grafana/d/000000012/synapse

I've been observing synapse processes regularly spiking to 100% CPU over the last month or so. Thing is, I've just switched to PyPy, so you won't see how much impact has the state group cleanup actually made. Another thing is I still haven't learned to properly visualize counters and rates with Prometheus/Grafana, so you won't find those 100% peaks on my dashboard.

So, I'm not sure if that data is any useful.

richvdh commented 4 years ago

this issue is unrelated to rust-synapse-compress-state; I've hidden a lot of off-topic comments about it.

erikjohnston commented 4 years ago

https://github.com/erikjohnston/synapse-find-unreferenced-state-groups is an old tool I wrote for finding unreferenced state groups, though you have to be very careful if running against a live synapse

MurzNN commented 4 years ago

https://github.com/erikjohnston/synapse-find-unreferenced-state-groups is an old tool I wrote for finding unreferenced state groups, though you have to be very careful if running against a live synapse

Thanks for tool! Can we safely use it now (without fear to broke something) on fresh Synapse version, that is in stopped state?

erikjohnston commented 4 years ago

I believe it should be safe if Synapse is not running.

darkphoenix commented 4 years ago

There needs to be something done about this, preferably automatically - I run a homeserver with a single user and the database is getting close to 15GB (after just a few months) and causing me actual disk space issues, this is clearly insane, that can't possibly be working as intended.

pixelyo commented 4 years ago

I seem to have the same problem on my fresh home server instance with 7 users. The instance has only been running for just about 10 days and has accumulated almost 1gb of data (incremental backup, ~100mb per day). No big rooms are on the instance or are connected to the instance. Only to test federation I joined the synapse admin room on the matrix.org instance for about a day (I think that might be the reason for the big initial size).

Below a little overview on how the backup size of my hs grew. It includes the postgres data (currently 436.7mb) as well as the media files (currently 237mb). (live data is smaller than backup, since it is a incremented backup)

Day Backup Size
2 253 mb
3 339 mb
4 431 mb
5 510 mb
6 555 mb
8 632 mb
9 775 mb
10 853 mb
11 919 mb
richvdh commented 4 years ago

to check if you are affected by this issue, run this query:

select count(*) from state_groups sg
    left join event_to_state_groups esg on esg.state_group=sg.id
    left join state_group_edges e on e.prev_state_group=sg.id
where esg.state_group is null and e.prev_state_group is null;

if you see numbers in the thousands, then it is this issue. Otherwise, you're not (and it's probably expected behaviour).

pixelyo commented 4 years ago

looks like I am not affected by this issue:

synapse=# select count(*) from state_groups sg
synapse-#     left join event_to_state_groups esg on esg.state_group=sg.id
synapse-#     left join state_group_edges e on e.prev_state_group=sg.id
synapse-# where esg.state_group is null and e.prev_state_group is null;
 count 
-------
    73
(1 row)

what made me wonder is why synapse generating about 100mb of data each day with almost no activity. Or is 100mb per day with almost no activity normal?

aaronraimist commented 4 years ago

My server is at 8681

MurzNN commented 4 years ago

what made me wonder is why synapse generating about 100mb of data each day with almost no activity. Or is 100mb per day with almost no activity normal?

@pixelyo Try to run sql queries from this article https://github.com/matrix-org/synapse/wiki/SQL-for-analyzing-Synapse-PostgreSQL-database-stats for understand which rooms eats most of space.

pixelyo commented 4 years ago

@pixelyo Try to run sql queries from this article https://github.com/matrix-org/synapse/wiki/SQL-for-analyzing-Synapse-PostgreSQL-database-stats for understand which rooms eats most of space.

those queries are handy. the synapse-admin:matrix.org room is by far the biggest, but no local users are connected to it.

MurzNN commented 4 years ago

@pixelyo Try to run sql queries from this article https://github.com/matrix-org/synapse/wiki/SQL-for-analyzing-Synapse-PostgreSQL-database-stats for understand which rooms eats most of space.

those queries are handy. the synapse-admin:matrix.org room is by far the biggest, but no local users are connected to it.

So you not affected to this issue, let's discuss this problem in other place. You can purge this room from database, useful article about this is here https://levans.fr/shrink-synapse-database.html

darkphoenix commented 4 years ago

I get some 48,000 results, so I'd say it's quite clear I'm affected - I'm not sure which rooms are causing it, but I assume the fact I'm running an IRC bridge and am in several big rooms is a factor (#freenode in particular generates a lot of member events, which I understand exacerbate this issue).

verymilan commented 4 years ago
synapse=> select count(*) from state_groups sg
synapse->     left join event_to_state_groups esg on esg.state_group=sg.id
synapse->     left join state_group_edges e on e.prev_state_group=sg.id
synapse-> where esg.state_group is null and e.prev_state_group is null;
  count   
----------
 99157248
(1 row)

...did i win?

richvdh commented 4 years ago

...did i win?

you ... really did. wtf is going on there?

verymilan commented 4 years ago

i would like to know that - not only because of the speed my database grows, but also because my cpu load made me find this issue... i am trying to set up workers but after restarts it does not take long and my poor little ryzen gets pushed to a load of 8-14 and everything becomes slow again... h54eh43rzh (thats only one generic worker running, only one of them wants to start up right now for some reason, but thats a different topic)

i am happy to help and learn, if you have something i could provide info of for you...

apart from the load i am quite happy that i have migrated to zfs with lz4... only compress-state with analyze/reindex seems not to do that much for me and i don't really want to interrupt the service for a full vacuum... pg_diskspace-year

rubo77 commented 4 years ago

One of our databases grew up to 380gb. Unfortunately the Harddisk failed and the server is gone now

MurzNN commented 4 years ago

I remembered, that my state groups was start significantly grow seems after I generated new key for Synapse at 2019-05-18, and seems decreased significantly growing when I place old key in "old_signing_keys" section of Synapse YML file (here https://github.com/matrix-org/synapse/issues/5958 is my issue about this) after some months, at 2019-09-01.

Maybe this info helps with investigating issue. Was other affected Synapse's do the key regeneration procedure?

TheDiscordian commented 4 years ago
You are now connected to database "synapse" as user "postgres".
synapse=# select count(*) from state_groups sg
synapse-#     left join event_to_state_groups esg on esg.state_group=sg.id
synapse-#     left join state_group_edges e on e.prev_state_group=sg.id
synapse-# where esg.state_group is null and e.prev_state_group is null;
  count   
----------
 47884451
(1 row)

Is there a bandaid I can do? I tried compressing my state, and was alerted to this issue. See matrix-org/rust-synapse-compress-state#21.

Edit:

Is deleting everything returned by https://github.com/erikjohnston/synapse-find-unreferenced-state-groups for problem rooms a temporary fix?

MurzNN commented 4 years ago

@TheDiscordian, does those rooms federated? Did you do regenerating of Synapse key after creating rooms, that have millions of state groups now?

TheDiscordian commented 4 years ago

@TheDiscordian, does those rooms federated? Did you do regenerating of Synapse key after creating rooms, that have millions of state groups now?

Rooms are most likely federated, I don't think I've regenerated a key.

Edit: Also, I run the Discord bridge mx-puppet-discord, but so do others, so I'm not sure...

MurzNN commented 4 years ago

The interesting thing, that I have rooms, that accummulated millions of state groups, but current_state_events shows much less numbers, example:

# SELECT room_id, count(*) cnt FROM state_groups_state GROUP BY room_id ORDER BY cnt DESC LIMIT 1;
                room_id                 |   cnt    
----------------------------------------+----------
 !GibBpYxFGNraRsZOyl:matrix.org         | 64182319

# select count(*) from state_groups sg
    left join event_to_state_groups esg on esg.state_group=sg.id
    left join state_group_edges e on e.prev_state_group=sg.id
where esg.state_group is null and e.prev_state_group is null;

 count  
--------
 167228

# SELECT r.name, s.room_id, s.current_state_events
   FROM room_stats_current s
   LEFT JOIN room_stats_state r USING (room_id)
 WHERE s.room_id = '!GibBpYxFGNraRsZOyl:matrix.org';
      name       |            room_id             | current_state_events 
-----------------+--------------------------------+----------------------
 Techlore - Main | !GibBpYxFGNraRsZOyl:matrix.org |                19103
(1 row)

Why the numbers are such differ?

richvdh commented 4 years ago

Why the numbers are such differ?

because current_state_events contains only the current state of the room, whereas state_groups_state is used to track the state at all historical points in the room.

MurzNN commented 4 years ago

I have run erikjohnston/synapse-find-unreferenced-state-groups script on my database (with stopped Synapse), seems all finished well, here are results in total row counts:

 state_groups_state: 467977351 »» 418967321
 state_group_edges: 4947416 »» 4770055
 state_groups: 4765907 »» 4585992

And the output of the tool:

# time ./rust-synapse-find-unreferenced-state-groups -p postgresql://synapse:SECRET@localhost/synapse -o /tmp/sgs.txt
Fetched 4764093 state groups from DB
Fetching 1105 missing state groups from DB
...
Total state groups: 4778508
Found 194330 unreferenced groups

real    15m11.746s
user    13m15.233s
sys     0m14.012s

Table sizes:

state_groups_state | 87 GB »» 67 GB
state_group_edges | 479 MB »» 406 MB
state_groups | 863 MB »» 793 MB