Initial sync is taking a long time to finish

jffrancob commented 2 years ago

Description

Currently we're facing a big performance issue with Synapse when the user logs in to Element and performs the initial sync. The request is taking up to 20 minutes to finish. All we can see is a recurring query in the DB that looks like this:

WITH RECURSIVE state(state_group) AS ( VALUES(41595::bigint) UNION ALL SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group ) SELECT DISTINCT ON (type, state_key) type, state_key, event_id FROM state_groups_state WHERE state_group IN ( SELECT state_group FROM state )  AND ((type = 'm.room.history_visibility' AND state_key = '') OR (type = 'm.room.member' AND state_key = '@supervisor:domain.com')) ORDER BY type, state_key, state_group DESC

Steps to reproduce

Use synapse over two weeks
Mautrix WhatsApp bridge with backfilling
Add some custom state events
Few thousands of rooms with frequent join/leave events

It's important to add that our deployment is not federated but we use Mautrix-WhatsApp bridge and it was set up to perform backfilling. We tried to increase the machine resources, but it didn't improved the response time.

Version information

Version: Our Synapse version is v1.49.0 and we already tried to upgrade to v1.53.0
Platform: Docker Swarm over Debian bullseye

jffrancob commented 2 years ago

After some research we found some known bugs and issues with state groups.[1][2] [1] https://matrix-org.github.io/synapse/latest/usage/administration/state_groups.html [2] https://github.com/matrix-org/synapse/issues/3364

In our case the number of unreferenced state groups is 115173

select count(*) from state_groups sg
    left join event_to_state_groups esg on esg.state_group=sg.id
    left join state_group_edges e on e.prev_state_group=sg.id
where esg.state_group is null and e.prev_state_group is null;
 count  
--------
 115173
(1 row)

However, our database is kind of small:

SELECT pg_size_pretty( pg_database_size('synapse') );
 pg_size_pretty 
----------------
 1884 MB
(1 row)

Also, the tables are small too:

select schemaname as table_schema,
    relname as table_name,
    pg_size_pretty(pg_total_relation_size(relid)) as total_size,
    pg_size_pretty(pg_relation_size(relid)) as data_size,
    pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid))
      as external_size
from pg_catalog.pg_statio_user_tables
order by pg_total_relation_size(relid) desc,
         pg_relation_size(relid) desc
limit 10;
 table_schema |       table_name       | total_size | data_size | external_size 
--------------+------------------------+------------+-----------+---------------
 public       | event_json             | 311 MB     | 294 MB    | 17 MB
 public       | event_edges            | 139 MB     | 49 MB     | 90 MB
 public       | events                 | 128 MB     | 57 MB     | 71 MB
 public       | event_auth             | 114 MB     | 81 MB     | 33 MB
 public       | state_groups_state     | 67 MB      | 44 MB     | 23 MB
 public       | event_reference_hashes | 65 MB      | 29 MB     | 36 MB
 public       | state_groups           | 55 MB      | 32 MB     | 23 MB
 public       | event_search           | 55 MB      | 35 MB     | 20 MB
 public       | event_to_state_groups  | 43 MB      | 21 MB     | 22 MB
 public       | event_auth_chain_links | 37 MB      | 25 MB     | 13 MB
(10 rows)

And we don't have a large number of rooms:

select count(*) from rooms ;
 count 
-------
  3383
(1 row)

This is the total number of state groups per room:

SELECT room_id, count(*) cnt FROM state_groups_state GROUP BY room_id ORDER BY cnt DESC LIMIT 10;
                 room_id                  | cnt 
------------------------------------------+-----
 !BgEfFfYMhyOcttvNfO:domain.com | 634
 !yEqKcqXpBFSlmaaUrK:domain.com | 604
 !uSIfnORIJBDvblUerc:domain.com | 461
 !XfQBTuKiJUHBDccOJo:domain.com | 456
 !BGgFHXSbKHvIxtiMBn:domain.com | 447
 !iECYZMWICqTXgiCPfr:domain.com | 436
 !BBZfJtPWDLJQltvdpU:domain.com | 432
 !emAdrLpvJFfSxRVUJu:domain.com | 422
 !xSPVtuCmdALkdlWYwb:domain.com | 405
 !VvZUZDTnaWEWMMhVbM:domain.com | 403
(10 rows)

We have some custom state events that are sent to the rooms, but the amount of current state events is small:

SELECT s.room_id, s.current_state_events
   FROM room_stats_current s
   LEFT JOIN room_stats_state r USING (room_id)
 WHERE s.room_id ='!BgEfFfYMhyOcttvNfO:domain.com';
                 room_id                  | current_state_events 
------------------------------------------+----------------------
 !BgEfFfYMhyOcttvNfO:domain.com |                   18
(1 row)

jffrancob commented 2 years ago

This is what we've tried to improve performace:

CREATE INDEX state_groups_state_state_key_type ON state_groups_state (state_key, type);
ALTER TABLE state_groups_state ALTER COLUMN state_group SET (n_distinct = -0.02);

ANALYZE state_groups_state (state_group)

ALTER TABLE state_groups_state ALTER state_key SET STATISTICS 5000;
ALTER TABLE state_groups_state ALTER type SET STATISTICS 5000;
ANALYZE state_groups_state

according to: https://github.com/matrix-org/synapse/issues/5064 https://github.com/matrix-org/synapse/issues/7618 https://github.com/matrix-org/synapse/pull/10359

We have checked these issues too: https://github.com/matrix-org/synapse/issues/7772 https://github.com/matrix-org/synapse/issues/11779

but we don't have any duplicare records

SELECT COUNT(1) from ( SELECT state_group, prev_state_group FROM state_group_edges GROUP BY state_group, prev_state_group HAVING COUNT(ctid) > 1 ) AS dupes;
 count 
-------
     0
(1 row)

We added an index to state_groups_state accordng to: https://github.com/matrix-org/synapse/issues/9182

CREATE INDEX state_group_idx ON state_groups_state USING BTREE(state_group);

We also did maintenance operations according to: https://github.com/matrix-org/synapse/issues/11184

VACUUM FULL
REINDEX
ANALYZE

reivilibre commented 2 years ago

Hi! Thanks for the extensive write-up.

Since it's not especially clear: Is the query that you're seeing the same every time or does the number change? Initial sync is quite a slow operation, especially if your users have a large number of rooms, so perhaps it's not all that unexpected (though 20 minutes still sounds high).

The main thing I would suggest is tuning your caches: to do so, the rough idea is that you want to set up monitoring (Prometheus + Grafana is what we use) and look at the graphs for caches to figure out which caches should be increased in size. We seem to have some documentation about that, see here: https://matrix-org.github.io/synapse/latest/usage/administration/understanding_synapse_through_grafana_graphs.html#caches. Alternatively you can just increase the cache factor and see what happens.

Nowadays you should be able to turn your caches up quite a bit without too much worry about wasteful memory usage, since entries also expire after 30 minutes.

You could also consider offloading some of these expensive operations to different workers, but I don't know the usage of your homeserver much to advise whether that's suitable.

You might also consider tuning your Postgres database if you haven't done so; some people recommend 'PGTune' to help you get started https://pgtune.leopard.in.ua/. Using SSD rather than spinning hard drives seems to help, as well.

jffrancob commented 2 years ago

Hello.

We have done some tests according to the suggestions and we would like to include some information:

The number in the query changes a lot. The database log looks like this:

WITH RECURSIVE state(state_group) AS ( VALUES(30051::bigint) UNION ALL SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group ) SELECT DISTINCT ON (type, state_key) type, state_key, event_id FROM state_groups_state WHERE state_group IN ( SELECT state_group FROM state )  AND ((type = 'm.room.history_visibility' AND state_key = '') OR (type = 'm.room.member' AND state_key = '@supervisor:domain.com')) ORDER BY type, state_key, state_group DESC
SELECT tag, content FROM room_tags WHERE user_id = '@supervisor:domain.com' AND room_id = '!FzoifeqrdPBmHrqnPR:domain.com'
SET default_transaction_isolation TO DEFAULT
WITH RECURSIVE state(state_group) AS ( VALUES(15780::bigint) UNION ALL SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group ) SELECT DISTINCT ON (type, state_key) type, state_key, event_id FROM state_groups_state WHERE state_group IN ( SELECT state_group FROM state )  AND ((type = 'm.room.history_visibility' AND state_key = '') OR (type = 'm.room.member' AND state_key = '@supervisor:domain.com')) ORDER BY type, state_key, state_group DESC
BEGIN ISOLATION LEVEL REPEATABLE READ
SELECT event_id, topological_ordering, stream_ordering FROM event_relations INNER JOIN events USING (event_id) WHERE relates_to_id = '$Sxcjm83FA3XhaaWSicYHm0tVE2psVBwinRPMpfF9vnE' AND room_id = '!FzyWxEeqZyZcIYVuwY:domain.com' AND relation_type = 'm.reference' ORDER BY topological_ordering ASC, stream_ordering ASC LIMIT 6
COMMIT
COMMIT
SELECT tag, content FROM room_tags WHERE user_id = '@supervisor:domain.com' AND room_id = '!FzoFEvRzLCJvkvtdVy:domain.com'
BEGIN ISOLATION LEVEL REPEATABLE READ
BEGIN ISOLATION LEVEL REPEATABLE READ
SELECT e.event_id, e.stream_ordering, ej.internal_metadata, ej.json, ej.format_version, r.room_version, rej.reason, e.outlier FROM events AS e JOIN event_json AS ej USING (event_id) LEFT JOIN rooms r ON r.room_id = e.room_id LEFT JOIN rejections as rej USING (event_id) WHERE e.event_id = ANY(ARRAY['$MuFlEJj_sHwBmVJ6nXsDxFK-LgfBbcCdID2C6cpBtuw'])
SET default_transaction_isolation TO DEFAULT
BEGIN ISOLATION LEVEL REPEATABLE READ
WITH RECURSIVE state(state_group) AS ( VALUES(238978::bigint) UNION ALL SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group ) SELECT DISTINCT ON (type, state_key) type, state_key, event_id FROM state_groups_state WHERE state_group IN ( SELECT state_group FROM state )  AND ((type = 'm.room.history_visibility' AND state_key = '') OR (type = 'm.room.member' AND state_key = '@supervisor:domain.com')) ORDER BY type, state_key, state_group DESC
WITH RECURSIVE state(state_group) AS ( VALUES(238948::bigint) UNION ALL SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group ) SELECT DISTINCT ON (type, state_key) type, state_key, event_id FROM state_groups_state WHERE state_group IN ( SELECT state_group FROM state )  AND ((type = 'm.room.history_visibility' AND state_key = '') OR (type = 'm.room.member' AND state_key = '@supervisor:domain.com')) ORDER BY type, state_key, state_group DESC
BEGIN ISOLATION LEVEL REPEATABLE READ
SET LOCAL enable_seqscan=off
WITH RECURSIVE state(state_group) AS ( VALUES(238949::bigint) UNION ALL SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group ) SELECT DISTINCT ON (type, state_key) type, state_key, event_id FROM state_groups_state WHERE state_group IN ( SELECT state_group FROM state )  AND ((type = 'm.room.history_visibility' AND state_key = '') OR (type = 'm.room.member' AND state_key = '@supervisor:domain.com')) ORDER BY type, state_key, state_group DESC
BEGIN ISOLATION LEVEL REPEATABLE READ

The database's tunning according to https://pgtune.leopard.in.ua/ doesn't seem to improve the /sync time

jffrancob commented 2 years ago

Also, we ran synapse_auto_compressor and the number of unreferenced state groups decreased to 92852. A lot of room were skipped with the message: The compressor tried to increase the number of rows in ..

Again, maybe the backfilling is related, as stated in synapse_auto_compressor project page: Note: if the compressor fails to find space savings on the chunk as a whole (which may well happen in rooms with lots of backfill in) then the entire chunk is skipped.

select count(*) from state_groups sg
    left join event_to_state_groups esg on esg.state_group=sg.id
    left join state_group_edges e on e.prev_state_group=sg.id
where esg.state_group is null and e.prev_state_group is null;
 count 
-------
 92852
(1 row)

After this, the /sync time is not better.

About the storage, we have this deployment in the cloud with SSD disks.

dbedoya commented 2 years ago

We also set up Prometheus and Grafana, and got these top 10 cache misses:

Any thoughts about how we can improve cache performance in the synapse settings?

H-Shay commented 2 years ago

You can tune the caches through two methods: setting a global factor, which is applied to all caches, or setting cache factors for individual caches. Increasing the cache factor will reduce cache misses, but will increase memory usage. You can find out more about how to set the cache factors here in the sample config under the heading Caches. It seems like from this graph increasing the cache factor for getEvent and DeviceFederationOutboxStreamChangeCache might be a good place to start. Let us know if doing so helps resolve the issue at all.

dbedoya commented 2 years ago

Hi.

We've working really hard in trying to reduce the initial sync time but the results haven't been very hopeful.

We set up the caches environment varibles in our docker synapse service like this:

SYNAPSE_CACHE_FACTOR_GETEVENT=2.0
SYNAPSE_CACHE_FACTOR_STATEGROUPCACHE=2.0
SYNAPSE_CACHE_FACTOR_STATEGROUPMEMBERSCACHE=2.0
SYNAPSE_CACHE_FACTOR_DEVICEFEDERATIONOUTBOXSTREAMCHANGECACHE=2.0

but login time didn't change much (just like 30 secs of difference).

We've tried a lot of combinations of the state_groups compressor script, vacuum, reindex, index for (state_group, key) in the state_groups_state table but again, we haven't been able to reduce the initial sync time after login in Element.

DMRobertson commented 2 years ago

Hi @dbedoya and @jffrancob. Here are a few questions which might shed some light on the situation:

What does your deployment look like---are you using workers or a monolith?
Does grafana show any obvious bottlenecks for the system?
- What does CPU usage look like under typical operation, and when processing initial syncs?
Is every initial sync affected, or only certain users' initial syncs?
Did setting those cache environment variables change anything about the cache miss graphs?
- It might also be interesting to look at the "Cache eviction rate" graph to see if anything is being evicted based on size.

jffrancob commented 2 years ago

Hi @DMRobertson.

At this point we think it's important to add that our deployments use Mautrix-WhatsApp bridge and a bot joined in each room is used to invite a member into the room. Once the conversation ends the member leaves the room and the process starts again in the future.

We use a stack in Docker Swarm with 5 services:

synapse
synapse postgres database
mautrix-whatsapp bridge
two bots built with matrix-nio

We manage an average of 5000 rooms in each deployment. Apart from the owner of the room (which is the user created by the mautrix bridge for each room) we have about 10 - 20 users that, as we said before, join and leave the room constantly.

We think this deployment is not complex enough to add workers. Since we don't have a lot of simultaneous /sync requests and just a single /sync request takes a long time to finish, we think using workers is not going to improve the time.

jffrancob commented 2 years ago

Each deployment is using a dedicated VPS, which shows a low CPU and memory usage during typical operation. When a single user logs in in Element the CPU goes to almost 100% for the whole log in process:

jffrancob commented 2 years ago

To answer your third question, we have noticed that the users who have joined and left rooms many times appears to be more affected and the users recently created are not affected at all. Plus, we took a user and performed a leave in all the rooms he have joined and the /sync time for that user didn't improve.

reivilibre commented 2 years ago

Just a small thing that came to mind?

we have noticed that the users who have joined and left rooms many times appears to be more affected

Are they joining/leaving different rooms each time? Perhaps you would do well by having the users 'forget' the rooms they leave so that they don't come down /sync in the future as historical rooms: https://spec.matrix.org/v1.2/client-server-api/#post_matrixclientv3roomsroomidforget

H-Shay commented 2 years ago

Is this still an issue/has this been resolved?

jffrancob commented 2 years ago

It's still an issue in our environments.

We've tried a lot of approaches, but none of them have improved the /sync time for the users having the issue. Neither purging the history (even forgetting the room). In the meantime, we have deactivated the users with long /sync times, created new users to act as replacements and force joined the new users into the same rooms the deactivated users were. The new users can perform /sync in a short time.

Having this into account we think the problem in /sync request is that the state_groups queries are performed since the first time the user joined the room, no matter if a forget has been executed.

squahtx commented 2 years ago

When users are leaving the rooms, is the leave initiated by their own user, or does the bot kick them?

In the former case, initial sync shouldn't do any processing for those rooms, unless include_leave is specified in the room filter. In the latter case, initial sync will continue to process those rooms, which I can believe would take a long time.

For synapse developers: _get_all_rooms is called during initial sync here and filters out leaves, but not kicks, here.

richvdh commented 2 years ago

@jffrancob are you still experiencing this? Are you able to answer squahtx's question?

thommierother commented 2 years ago

Has there been any progress on this issue?

I see the same effect on two different setups (a. dockerized matrix-synapse, b. matrix-synapse in lxc container, both with haproxy as ssl accelerator in front and postgres in a separate container).

reivilibre commented 1 year ago

@thommierother It might be worthwhile opening your own issue with a fuller description, linking to this one if you think it's related. This issue isn't very clear and the original reporter seems to have gone inactive, so it will probably be cleaner that way and avoid derailing this existing thread in case they turn out to be different things. If you can, try and be quite specific about what you're doing, noting the sorts of questions that have been asked in this thread. Thanks!

richvdh commented 1 year ago

closing as there has been no update from the original reporter.

It's worth noting that "Initial sync is taking a long time to finish" is pretty much business as usual, at least on a largish account.

thommierother commented 1 year ago

I will open a new issue for this as I see the problem on a fresh synapse installation with a fresh account

bramenn commented 1 year ago

Hi, an update on the test requested by @squahtx :

We took a synapse implementation where a user was joined in 4000 rooms.

Note: This user was not leaving the rooms where he was interacting, he was kicked by a bot, so he had a lot of events in thousands of rooms related to kick event.

The first test we did was to kick this user out of the 4000 rooms he was in and then we logged in with this user and had a wait time of over 8 minutes.

The second test we did was to remove this user from the 4000 rooms using a leave event (to leave by himself), then we logged in and it took 1 minute and 45 seconds.

The third test was to take this user and invite him to all 13000 rooms that exist on the entire test server (obviously he was not invited to the rooms he had already joined) and this user automatically accepted the invitations. The user then left all the rooms he had joined. Then we connected and it took 5 seconds.

thommierother commented 1 year ago

ok, thanks for the info. In that case I guess my problem has another root cause. The server is virtualized (proxmox LXC container). WIll try with a dendrite (golang) setup again ...

matrix-org / synapse