Closed jminn closed 2 years ago
Doing an initial /sync
(i.e. one without a since
query param) is known to be expensive, but I'm really surprised that its causing such problems on such a small DB.
Do you have the same problem if you downgrade to v1.52.0? Otherwise if you could enable debug logs and send a copy over to @erikj:jki.re
on Matrix that would be fab (debug logs can contain some sensitive data). If you reproduce it is worth running the following query in postgres:
SELECT * FROM pg_stat_activity WHERE state != 'idle' ORDER BY xact_start;
which will list the current ongoing transactions/queries ordered oldest first. My best guess at this point is that there is an SQL query that is doing a large table scan, perhaps because the query planner has out dated statistics? You could try running VACUUM ANALYZE
to force postgres to update its statistics for all tables, but it may still come up with the wrong conclusions.
Downgrading to v1.52.0 doesn't seem to help.
It's easy to reproduce the issue - I just have to try to log in - but I don't want to let it run for too long as it kind of kills my server after some indeterminate point.
Running that SQL query while the issue was occurring gives the following:
synapse=# SELECT * FROM pg_stat_activity WHERE state != 'idle' ORDER BY xact_start;
datid | datname | pid | leader_pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | wait_event_type | wait_event | state | backend_xid | backend_xmin | query | backend_type
-------+---------+-----+------------+----------+---------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+------------+--------+-------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------
16384 | synapse | 656 | | 10 | synapse | | 127.0.0.1 | | 53296 | 2022-02-24 14:01:33.742588+00 | 2022-02-24 14:03:23.976357+00 | 2022-02-24 14:03:43.221059+00 | 2022-02-24 14:03:43.22106+00 | | | active | | 1166943 | WITH RECURSIVE state(state_group) AS ( VALUES(330::bigint) UNION ALL SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group ) SELECT DISTINCT ON (type, state_key) type, state_key, event_id FROM state_groups_state WHERE state_group IN ( SELECT state_group FROM state ) AND ((type = 'm.room.history_visibility' AND state_key = '') OR (type = 'm.room.member' AND state_key = '@username:hostname')) ORDER BY type, state_key, state_group DESC | client backend
16384 | synapse | 667 | | 10 | synapse | psql | | | -1 | 2022-02-24 14:03:42.500161+00 | 2022-02-24 14:03:52.724841+00 | 2022-02-24 14:03:52.724841+00 | 2022-02-24 14:03:52.724843+00 | | | active | | 1166946 | SELECT * FROM pg_stat_activity WHERE state != 'idle' ORDER BY xact_start; | client backend
(2 rows)
I checked a couple times over a period of a few minutes and that query seems to be continuing to run for at least that long.
I'll look into getting the debug logs. Also, I ran VACUUM ANALYZE
but that didn't seem to resolve the issue either.
@jminn Can you run
SELECT COUNT(1) from ( SELECT state_group, prev_state_group FROM state_group_edges GROUP BY state_group, prev_state_group HAVING COUNT(ctid) > 1 ) AS dupes;
on your Postgres and respond with the answer? (Wondering if it's related to: #11779)
synapse=# SELECT COUNT(1) from ( SELECT state_group, prev_state_group FROM state_group_edges GROUP BY state_group, prev_state_group HAVING COUNT(ctid) > 1 ) AS dupes;
count
-------
475
(1 row)
Right, #11779 sounds like it's biting you. Did you recently restore a database backup and potentially restore it multiple times?
About a year ago I changed server hardware and moved the database from the old hardware using pg_dump and pg_restore. I have no idea about whether I did it multiple times.
After looking through some of the linked issues, it sounds like I just needed to de-duplicate the state_group_edges table. I tried this and it seemed to fix the issue:
SELECT DISTINCT * INTO state_group_edges_2 FROM state_group_edges;
DELETE FROM state_group_edges WHERE (state_group, prev_state_group) IN (SELECT * FROM state_group_edges_2);
INSERT INTO state_group_edges SELECT * FROM state_group_edges_2;
DROP TABLE state_group_edges_2;
Description
Logging a new device into my homeserver is hanging indefinitely and eventually running the server out of memory.
Homeserver info: 5 users, 250 MB total DB size, and almost no federation (just with two other small homeservers).
Upon login of my new device (tried with both Element for Android and Element web, and both hit the same issue), the device hangs on the "initial sync" while the server has a postgresql process start running at 100% of 1 cpu core and continues until the system runs out of memory (postgres was using 13GB last time it crashed). I've also seen the database go from 250MB to ~60GB while this is occurring, but it returns to the previous size after restarting it. I have 3 other already-logged-in devices which are all working fine, I just can't seem to add any new ones.
Steps to reproduce
Normally login completes almost immediately and there's no noticeable wait for an "initial sync", and of course the server doesn't usually crash.
Logs from while the issue was occurring (I killed the DB/synapse at the end of these logs to avoid causing server issues):
Version information
If not matrix.org:
Version: 1.53.0
Install method: Running in container
Platform: Running in container (https://hub.docker.com/layers/matrixdotorg/synapse/latest/images/sha256-fc026ff4b2600d06382d53a1a5e411174342459ebf44529cb1562be065308bd9)