matrix-org / rust-synapse-compress-state

A tool to compress some state in a Synapse instance's database
https://pypi.org/project/synapse-auto-compressor/
Apache License 2.0
143 stars 33 forks source link

Some Rooms Require Over 20GB Storage #21

Open TheDiscordian opened 3 years ago

TheDiscordian commented 3 years ago

Describe the bug On the "Fetching state from DB for room" step, storage space is continuously consumed until space is empty, errors, then frees the space. Might be related to #6 (maybe even a duplicate?), but I'm talking about storage space, not memory. I only have 14GB storage right now, but two of my rooms with only 110141 and 167096 state groups can't seem to have this tool successfully run, because I run out of storage.

To Reproduce Honestly not sure how you'd reproduce it. I can't see anyone else with the issue, it happens when I run it on !zTAqnOWiFuKTlnGOhq:matrix.thedisco.zone and !tmgqjKkMXUbqUHECPV:matrix.thedisco.zone, I don't know what those rooms are.

Expected behavior Same as when I run on any other room, consumes a bit of storage, then finishes normally.

VPS:

I have ~14GB storage free, and 3GB RAM available while running this tool.

Additional context

Command run: ./synapse-compress-state -t -o state-compressor.sql -p "host=localhost user=<redacted> password=<redacted> dbname=<redacted>" -r "!zTAqnOWiFuKTlnGOhq:matrix.thedisco.zone"

Error received:

thread 'main' panicked at 'called `Result::unwrap()`on an`Err` value: Error { kind: Db, cause: Some(DbError { severity: "ERROR", parsed_severity: Some(Error), code: SqlState("53100"), message: "could not write to file \"base/pgsql_tmp/pgsql_tmp810.0.sharedfileset/i2924of8192.p0.0\": No space left on device", detail: None, hint: None, position: None, where_: None, schema: None, table: None, column: None, datatype: None, constraint: None, file: Some("buffile.c"), line: Some(526), routine: Some("BufFileDumpBuffer") }) }', src/libcore/result.rs:1188:5

It seems to be a postgresql error. So if this is an upstream issue I guess this can be closed. However it would be nice to understand why this happens on these rooms and not others. I'm currently running it on larger rooms, and I'm not even noticing storage being consumed, but that room wants to use over 14GB.

I'm still running this on several rooms, I'll update if I notice anything else related.

TheDiscordian commented 3 years ago

So I doubled my plan to investigate this, strange results. It used ~21GB, and after finally getting it to finish, it wants to actually add rows (!tmgqjKkMXUbqUHECPV:matrix.thedisco.zone seemed to work more-or-less normally, other than it's desire to consume my storage):

Fetching state from DB for room '!zTAqnOWiFuKTlnGOhq:matrix.thedisco.zone'...
  [30s] 111315 rows retrieved
Got initial state from database. Checking for any missing state groups...
No missing state groups
Number of state groups: 95897
Number of rows in current table: 111314
Compressing state...
[00:00:05] ████████████████████ 95897/95897 state groups
Number of rows after compression: 112925 (101.45%)
Compression Statistics:
  Number of forced resets due to lacking prev: 0
  Number of compressed rows caused by the above: 0
  Number of state groups changed: 71314
Writing changes...
[00:00:05] ████████████████████ 95897/95897 state groups
Checking that state maps match...
[00:00:03] ████████████████████ 95897/95897 state groups
New state map matches old one

That's definitely related to #7.

I ran into a roadblock because most operations on postgres and using this tool on rooms with over 3 million states want to use well over 10GB. So now I'm running it on my remaining rooms with 80GB free, to see what happens. I'm still not sure if the issues I run into while running the state compressor are related to me being overly optimistic on how Postgres works, or a quirk with the tool itself.

TheDiscordian commented 3 years ago

Another run that looks a lot like #7 to me:

Fetching state from DB for room '!crIjBxWRQKDgFslSWe:matrix.thedisco.zone'...
  [5m] 3528863 rows retrieved
Got initial state from database. Checking for any missing state groups...
No missing state groups
Number of state groups: 3528484
Number of rows in current table: 3528862
Compressing state...
[00:07:42] ████████████████████ 3528484/3528484 state groups
Number of rows after compression: 4646547 (131.67%)
Compression Statistics:
  Number of forced resets due to lacking prev: 1429
  Number of compressed rows caused by the above: 72453
  Number of state groups changed: 3528162
Writing changes...
[00:04:16] ████████████████████ 3528484/3528484 state groups
Checking that state maps match...
[00:02:04] ████████████████████ 3528484/3528484 state groups
New state map matches old one

Note: First run consumed over 12GB and crashed, second run (with larger disk) consumed ~11GB, then steadily decreased usage as it ran for several more minutes.

Additionally the process took a couple minutes to exit after the final output. I think it's correct to assume I shouldn't add that to my DB?

Edit:

Fetching state from DB for room '!KFngiVrDuaSdqzHTyb:matrix.thedisco.zone'...
  [8m] 4319179 rows retrieved
Got initial state from database. Checking for any missing state groups...
No missing state groups
Number of state groups: 3982078
Number of rows in current table: 4319178
Compressing state...
[00:11:20] ████████████████████ 3982078/3982078 state groups
Number of rows after compression: 6128284 (141.89%)
Compression Statistics:
  Number of forced resets due to lacking prev: 12326
  Number of compressed rows caused by the above: 1056363
  Number of state groups changed: 3981588

Edit 2:

For rooms over 9 million states, over 6.5GB RAM is consumed, and I hit #6.

TheDiscordian commented 3 years ago

I'm wondering if the issues described here are symptoms of matrix-org/synapse#3364.

MurzNN commented 3 years ago

@TheDiscordian can you count number of state groups for those problematic rooms manually, via SQL query:

SELECT room_id, count(*) cnt FROM state_groups_state WHERE room_id = '!KFngiVrDuaSdqzHTyb:matrix.thedisco.zone'

for understand, does count of state groups from script equal to real count, or something goes wrong?

For example, I have some rooms, that have 10+ millions of state groups:

# SELECT room_id, count(*) cnt FROM state_groups_state GROUP BY room_id ORDER BY cnt DESC LIMIT 20;
                room_id                 |   cnt    
----------------------------------------+----------
 !GibBpYxFGNraRsZOyl:matrix.org         | 64182319
 !YynUnYHpqlHuoTAjsp:matrix.org         | 45332229
 !QtykxKocfZaZOUrTwp:matrix.org         | 25002578
 !YYtOqtdMtFNanKzfuQ:matrix.org         | 20940299
 !iEiJZbwrOzEkZNjsYf:matrix.org         | 19405073
 !yhqiEdqNjyPbxtUjzm:matrix.org         | 18111151
 !TdAwENXmXuMrCrFEFX:maunium.net        | 15345266
 !AinLFXQRxTuqNpXyXk:matrix.org         | 14380484
 !BvarTFnpDHTUVRxQwu:matrix.org         | 13982609
 !BsnSQSkXzTpoPmrTZt:matrix.org         | 12967142
 !LuJjThBOzlYIbhxrnb:matrix.org         | 10909695
 !XZRkbGFMRqNOBdwevA:irc.snt.utwente.nl | 10119807
 !OGEhHVWSdvArJzumhm:matrix.org         |  9941964
 !BfPzMJuQMaOmBzFOdD:matrix.org         |  9714868
 !HwocBmCtBcHQhILtYQ:matrix.org         |  9229235
 !uDQoIebqsjEEtmWLrO:disroot.org        |  7311186
 !jDLOUpjYLfQYvmOufZ:matrix.org         |  7067452
 !mjbDjyNsRXndKLkHIe:matrix.org         |  6882739
 !SnjanvpKioPhkfzzPu:matrix.org         |  6120224
 !UUrMiRkIyPVcevkjdl:matrix.org         |  5958264
(20 rows)

And this is not effect of https://github.com/matrix-org/synapse/issues/3364 issue, because of count is less than million:

select count(*) from state_groups sg
    left join event_to_state_groups esg on esg.state_group=sg.id
    left join state_group_edges e on e.prev_state_group=sg.id
where esg.state_group is null and e.prev_state_group is null;
 count  
--------
 167228
MurzNN commented 3 years ago

I have successfully ran the script to optimize the one room with 64 182 319 of state events, count after compression is 8 655 766 rows (13.4%). Script works about 3 hours on VPS with 32 GB of RAM, eats about 6 GB of RAM (10 GB VIRT) when working, and produced the sql file with size at 850 MB. Executing of insert-delete SQL queries was spent about 30 minutes. Size of state_groups_state table was 551 085 432 rows (94GB), becomes 496 093 360 rows (VACUUM state_groups_state didn't reduce the real storage size, after VACUUM FULL - 80GB).

TheDiscordian commented 2 years ago

FWIW with the new auto-compressor, I'm currently seeking settings that may mitigate this? I'm at over 50GB on some rooms, and it's wild paying for VPS space that's only used for maintenance sometimes. Running experiments rn, but it's slow even with NVMe drives for each test, so I doubt I'll have much to show for it. My DB is over 250GB last I looked : /. I think this is all used for a cache? If I could pick the drive, I could probably mitigate this (though I can't find an upper limit to these sizes...).

BTW MurzNN, I believe I did run that query of yours in another issue and we talked there. I didn't mean to ignore you for nearly 2 years :').

sdomi commented 1 year ago

Got into the same problem today. One room took over 24GB of space (the whole db is ~60GB), my postgresql partition ran out of space. Afterwards, the server never really worked again (synapse starts out with a few SELECT operations that never finish, and then everything related to the database just timeouts into oblivion).

I would not recommend using this tool unless you have a virtually unlimited amount of disk space :c