matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.83k stars 2.12k forks source link

Logging in a new device hangs the client and eventually crashes the server #12071

Closed jminn closed 2 years ago

jminn commented 2 years ago

Description

Logging a new device into my homeserver is hanging indefinitely and eventually running the server out of memory.

Homeserver info: 5 users, 250 MB total DB size, and almost no federation (just with two other small homeservers).

Upon login of my new device (tried with both Element for Android and Element web, and both hit the same issue), the device hangs on the "initial sync" while the server has a postgresql process start running at 100% of 1 cpu core and continues until the system runs out of memory (postgres was using 13GB last time it crashed). I've also seen the database go from 250MB to ~60GB while this is occurring, but it returns to the previous size after restarting it. I have 3 other already-logged-in devices which are all working fine, I just can't seem to add any new ones.

Steps to reproduce

Normally login completes almost immediately and there's no noticeable wait for an "initial sync", and of course the server doesn't usually crash.

Logs from while the issue was occurring (I killed the DB/synapse at the end of these logs to avoid causing server issues):

2022-02-24 03:25:51,398 - synapse.access.http.8018 - 427 - INFO - GET-117 - 99.176.15.169 - 8018 - {@username:hostname} Processed request: 0.000sec/0.000sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 198B 200 "GET /_matrix/client/r0/voip/turnServer HTTP/1.0" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0" [0 dbevts]
2022-02-24 03:25:51,399 - synapse.handlers.e2e_keys - 545 - INFO - POST-118 - Updating device_keys for device 'KEDPNYDEWP' for user @username:hostname at 1645673151399
2022-02-24 03:25:51,400 - synapse.access.http.8018 - 427 - INFO - GET-119 - 99.176.15.169 - 8018 - {@username:hostname} Processed request: 0.000sec/0.000sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 2B 200 "GET /_matrix/client/r0/thirdparty/protocols HTTP/1.0" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0" [0 dbevts]
2022-02-24 03:25:51,413 - synapse.access.http.8018 - 427 - INFO - GET-120 - 99.176.15.169 - 8018 - {None} Processed request: 0.000sec/0.000sec (0.000sec, 0.000sec) (0.000sec/0.000sec/0) 484B 200 "GET /_matrix/client/versions HTTP/1.0" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0" [0 dbevts]
2022-02-24 03:25:51,429 - synapse.access.http.8018 - 427 - INFO - GET-122 - 99.176.15.169 - 8018 - {@username:hostname} Processed request: 0.007sec/0.001sec (0.001sec, 0.001sec) (0.001sec/0.002sec/2) 3523B 200 "GET /_matrix/client/r0/pushrules/ HTTP/1.0" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0" [0 dbevts]
2022-02-24 03:25:51,431 - synapse.access.http.8018 - 427 - INFO - POST-118 - 99.176.15.169 - 8018 - {@username:hostname} Processed request: 0.032sec/0.001sec (0.002sec, 0.000sec) (0.001sec/0.027sec/4) 47B 200 "POST /_matrix/client/r0/keys/upload HTTP/1.0" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0" [0 dbevts]
2022-02-24 03:25:51,596 - synapse.access.http.8018 - 427 - INFO - GET-124 - 99.176.15.169 - 8018 - {@username:hostname} Processed request: 0.004sec/0.001sec (0.001sec, 0.000sec) (0.001sec/0.001sec/1) 69B 200 "GET /_matrix/client/r0/user/%40username%3Ahostname/filter/5 HTTP/1.0" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0" [0 dbevts]
2022-02-24 03:25:51,607 - synapse.access.http.8018 - 427 - INFO - GET-115 - 99.176.15.169 - 8018 - {@username:hostname} Processed request: 0.212sec/0.001sec (0.002sec, 0.000sec) (0.001sec/0.208sec/3) 772B 200 "GET /_matrix/client/unstable/room_keys/version HTTP/1.0" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0" [0 dbevts]
2022-02-24 03:25:53,026 - synapse.access.http.8018 - 427 - INFO - GET-107 - 99.176.15.169 - 8018 - {@username:hostname} Processed request: 30.002sec/0.000sec (0.001sec, 0.000sec) (0.000sec/0.001sec/1) 245B 200 "GET /_matrix/client/r0/sync?filter=5&timeout=30000&since=s10906_1685067_0_10158_11664_1_9691_347_1 HTTP/1.0" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0" [0 dbevts]
2022-02-24 03:25:53,060 - synapse.access.http.8018 - 427 - INFO - GET-128 - 99.176.15.169 - 8018 - {@username:hostname} Processed request: 0.002sec/0.000sec (0.000sec, 0.000sec) (0.000sec/0.000sec/1) 402B 200 "GET /_matrix/client/r0/sync?filter=5&timeout=30000&since=s10906_1685067_0_10158_11664_1_9691_347_1 HTTP/1.0" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0" [0 dbevts]
2022-02-24 03:25:53,085 - synapse.access.http.8018 - 427 - INFO - GET-109 - 99.176.15.169 - 8018 - {@username:hostname} Processed request: 30.002sec/0.000sec (0.002sec, 0.000sec) (0.000sec/0.001sec/1) 245B 200 "GET /_matrix/client/r0/sync?filter=5&timeout=30000&since=s10906_1685067_0_10158_11664_1_9691_347_1 HTTP/1.0" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0" [0 dbevts]
2022-02-24 03:25:53,123 - synapse.access.http.8018 - 427 - INFO - GET-132 - 99.176.15.169 - 8018 - {@username:hostname} Processed request: 0.002sec/0.000sec (0.001sec, 0.000sec) (0.000sec/0.001sec/1) 402B 200 "GET /_matrix/client/r0/sync?filter=5&timeout=30000&since=s10906_1685067_0_10158_11664_1_9691_347_1 HTTP/1.0" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0" [0 dbevts]
2022-02-24 03:26:16,272 - synapse.util.caches.lrucache - 154 - INFO - LruCache._expire_old_entries-18 - Dropped 0 items from caches

Version information

If not matrix.org:

erikjohnston commented 2 years ago

Doing an initial /sync (i.e. one without a since query param) is known to be expensive, but I'm really surprised that its causing such problems on such a small DB.

Do you have the same problem if you downgrade to v1.52.0? Otherwise if you could enable debug logs and send a copy over to @erikj:jki.re on Matrix that would be fab (debug logs can contain some sensitive data). If you reproduce it is worth running the following query in postgres:

SELECT * FROM pg_stat_activity WHERE state != 'idle' ORDER BY xact_start;

which will list the current ongoing transactions/queries ordered oldest first. My best guess at this point is that there is an SQL query that is doing a large table scan, perhaps because the query planner has out dated statistics? You could try running VACUUM ANALYZE to force postgres to update its statistics for all tables, but it may still come up with the wrong conclusions.

jminn commented 2 years ago

Downgrading to v1.52.0 doesn't seem to help.

It's easy to reproduce the issue - I just have to try to log in - but I don't want to let it run for too long as it kind of kills my server after some indeterminate point.

Running that SQL query while the issue was occurring gives the following:

synapse=# SELECT * FROM pg_stat_activity WHERE state != 'idle' ORDER BY xact_start;
 datid | datname | pid | leader_pid | usesysid | usename | application_name | client_addr | client_hostname | client_port |         backend_start         |          xact_start           |          query_start          |         state_change          | wait_event_type | wait_event | state  | backend_xid | backend_xmin |                                                                                                                                                                                                                                               query                                                                                                                                                                                                                                                |  backend_type
-------+---------+-----+------------+----------+---------+------------------+-------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+-----------------+------------+--------+-------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------
 16384 | synapse | 656 |            |       10 | synapse |                  | 127.0.0.1   |                 |       53296 | 2022-02-24 14:01:33.742588+00 | 2022-02-24 14:03:23.976357+00 | 2022-02-24 14:03:43.221059+00 | 2022-02-24 14:03:43.22106+00  |                 |            | active |             |      1166943 | WITH RECURSIVE state(state_group) AS ( VALUES(330::bigint) UNION ALL SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group ) SELECT DISTINCT ON (type, state_key) type, state_key, event_id FROM state_groups_state WHERE state_group IN ( SELECT state_group FROM state )  AND ((type = 'm.room.history_visibility' AND state_key = '') OR (type = 'm.room.member' AND state_key = '@username:hostname')) ORDER BY type, state_key, state_group DESC | client backend
 16384 | synapse | 667 |            |       10 | synapse | psql             |             |                 |          -1 | 2022-02-24 14:03:42.500161+00 | 2022-02-24 14:03:52.724841+00 | 2022-02-24 14:03:52.724841+00 | 2022-02-24 14:03:52.724843+00 |                 |            | active |             |      1166946 | SELECT * FROM pg_stat_activity WHERE state != 'idle' ORDER BY xact_start;                                                                                                                                                                                                                                                                                                                                                                                                                          | client backend
(2 rows)

I checked a couple times over a period of a few minutes and that query seems to be continuing to run for at least that long.

I'll look into getting the debug logs. Also, I ran VACUUM ANALYZE but that didn't seem to resolve the issue either.

reivilibre commented 2 years ago

@jminn Can you run

SELECT COUNT(1) from ( SELECT state_group, prev_state_group FROM state_group_edges GROUP BY state_group, prev_state_group HAVING COUNT(ctid) > 1 ) AS dupes;

on your Postgres and respond with the answer? (Wondering if it's related to: #11779)

jminn commented 2 years ago
synapse=# SELECT COUNT(1) from ( SELECT state_group, prev_state_group FROM state_group_edges GROUP BY state_group, prev_state_group HAVING COUNT(ctid) > 1 ) AS dupes;
 count 
-------
   475
(1 row)
reivilibre commented 2 years ago

Right, #11779 sounds like it's biting you. Did you recently restore a database backup and potentially restore it multiple times?

jminn commented 2 years ago

About a year ago I changed server hardware and moved the database from the old hardware using pg_dump and pg_restore. I have no idea about whether I did it multiple times.

jminn commented 2 years ago

After looking through some of the linked issues, it sounds like I just needed to de-duplicate the state_group_edges table. I tried this and it seemed to fix the issue:

SELECT DISTINCT * INTO state_group_edges_2 FROM state_group_edges;
DELETE FROM state_group_edges WHERE (state_group, prev_state_group) IN (SELECT * FROM state_group_edges_2);
INSERT INTO state_group_edges SELECT * FROM state_group_edges_2;
DROP TABLE state_group_edges_2;