synapse takes ages to restart

richvdh commented 4 years ago

Why does it take 70 seconds to initialise the database? It might be some sort of database connection contention but it would be nice if there were some log messages or something.

2020-07-28 09:56:32,141 - synapse.storage.data_stores - 50 - INFO - None - Preparing database 'master'...
2020-07-28 09:56:56,412 - synapse.storage.prepare_database - 325 - INFO - None - Upgrading schema to v58
2020-07-28 09:56:56,421 - synapse.storage.data_stores - 60 - INFO - None - Starting 'main' data store
2020-07-28 09:57:42,940 - synapse.config.appservice - 87 - INFO - None - Loaded application service: ApplicationService: ...

what's going on there?

richvdh commented 4 years ago

keywords: restart

richvdh commented 4 years ago

A lot of the problem here is this query:

SELECT instance_name, MAX(stream_id) FROM cache_invalidation_stream_by_instance GROUP BY instance_name

... which takes six seconds on a relatively idle database.

richvdh commented 4 years ago

this one takes 11 seconds:

SELECT COUNT(*) FROM users WHERE name NOT LIKE '%:matrix.org'

richvdh commented 4 years ago

hopefully the latter will be fixed by #8271.

richvdh commented 4 years ago

I've also manually cleared out cache_invalidation_stream_by_instance, which seems to have mostly resolved this problem, so I'm going to close this and leave #8269 to track a more permanent fix.

richvdh commented 4 years ago

it's still pretty slow, so I'm not sure it's fixed.

richvdh commented 4 years ago

shutting down also takes ages: that seems to be mostly because python runs a final GC at shutdown, which can take many seconds. Maybe we can skip that GC. (or maybe it improves if we fix https://github.com/matrix-org/synapse/issues/7176)

richvdh commented 4 years ago

(or maybe it improves if we fix #7176)

given we're running zope.interface-4.7.1 on matrix.org, that's sadly unlikely.

richvdh commented 4 years ago

(or maybe it improves if we fix #7176)

given we're running zope.interface-4.7.1 on matrix.org, that's sadly unlikely.

Per https://instagram-engineering.com/dismissing-python-garbage-collection-at-instagram-4dca40b29172, we could skip the final GC with an early atexit.register(os._exit, 0).

richvdh commented 4 years ago

I think I want to make this a release blocker; it feels like we've had a big regression here in the last couple of weeks.

I just tried restarting an idle worker:

Oct 01 15:21:52 hippogriff.matrix.org systemd[1882]: Stopping Synapse test_worker1 worker...
Oct 01 15:21:53 hippogriff.matrix.org systemd[1882]: synapse-worker@test_worker1.service: Succeeded.
Oct 01 15:21:53 hippogriff.matrix.org systemd[1882]: Stopped Synapse test_worker1 worker.
Oct 01 15:21:53 hippogriff.matrix.org systemd[1882]: Starting Synapse test_worker1 worker...
Oct 01 15:22:09 hippogriff.matrix.org systemd[1882]: synapse-worker@test_worker1.service: Main process exited, code=killed, status=15/TERM
Oct 01 15:22:09 hippogriff.matrix.org systemd[1882]: synapse-worker@test_worker1.service: Succeeded.
Oct 01 15:22:09 hippogriff.matrix.org systemd[1882]: Stopped Synapse test_worker1 worker.
Oct 01 15:22:09 hippogriff.matrix.org systemd[1882]: Starting Synapse test_worker1 worker...
Oct 01 15:22:35 hippogriff.matrix.org synctl[8957]: started synapse.app.generic_worker('/home/synapse/config/workers/test_worker1.yaml')
Oct 01 15:22:35 hippogriff.matrix.org systemd[1882]: Started Synapse test_worker1 worker.

42 seconds to restart an idle process is no good.

richvdh commented 4 years ago

I've also seen the startup sequence take more than 60 seconds (in addition to schema migrations)

richvdh commented 4 years ago

This is now much better thanks to #8447. I still think there's work to be done here (on matrix.org, shutdown takes 12 seconds; startup takes 11) so will leave this open.

richvdh commented 3 years ago

I just had another restart which took 83 seconds:

11:35:36: send SIGTERMs to workers
11:35:46: send SIGKILLs to workers
11:35:51: last worker finally dies
11:35:51: send SIGTERM to master
11:36:01: send SIGKILL to master
11:36:02: start master
11:36:29: master is up; start workers
11:36:59: workers started

A few thoughts:

do we (still) need to enforce the ordering of shutting down workers before starting the master, and starting the master before the workers? One reason for doing so used to be to ensure the DB was updated before starting the workers, but since #8266 the workers will simply refuse to start in this situation. Breaking this link would roughly halve the restart time.

During startup, the workers all pause for 16 seconds on:

2021-01-05 11:36:40,797 - synapse.storage.util.id_generators - 57 - INFO - None - initialising stream generator for presence_stream(stream_id)

can we make that faster?

The master, on the other hand, seems to pause for some different stream generators:

2021-01-05 11:36:14,787 - synapse.storage.util.id_generators - 57 - INFO - None - initialising stream generator for push_rules_stream(stream_id)
2021-01-05 11:36:21,874 - synapse.storage.util.id_generators - 57 - INFO - None - initialising stream generator for access_tokens(id)
2021-01-05 11:36:29,288 - synapse.storage.databases - 83 - INFO - None - [database config 'master']: Starting 'state' database

I still think we should skip the final GC per https://github.com/matrix-org/synapse/issues/7968#issuecomment-702082791.

richvdh commented 3 years ago

apparently the stream generator for push_rules_stream now takes 30 seconds to initialise:

2021-05-26 09:57:14,498 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for pushers(id)
2021-05-26 09:57:14,502 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for deleted_pushers(stream_id)
2021-05-26 09:57:14,503 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for push_rules_stream(stream_id)
2021-05-26 09:57:44,476 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for access_tokens(id)
2021-05-26 09:57:45,615 - synapse.storage.databases - 85 - INFO - main - [database config 'master']: Starting 'state' database

erikjohnston commented 3 years ago

(or maybe it improves if we fix #7176)

given we're running zope.interface-4.7.1 on matrix.org, that's sadly unlikely.

Per https://instagram-engineering.com/dismissing-python-garbage-collection-at-instagram-4dca40b29172, we could skip the final GC with an early atexit.register(os._exit, 0).

We need to be careful that we still run the functions we registered to run when the reactor stops

squahtx commented 3 years ago

I've made a PR to skip the final GC on shutdown which ought to shave off a few seconds: #10712 Sadly while testing on a homeserver in Matrix HQ, I saw lots of variation in the shutdown time due to reactor stalls, ranging from a few seconds to 2 minutes.

squahtx commented 3 years ago

We saw the same 30 second delay after initialising the stream generator for push_rules_stream again:

2021-08-31 10:36:32,594 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for pushers(id)
2021-08-31 10:36:32,606 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for deleted_pushers(stream_id)
2021-08-31 10:36:32,613 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for push_rules_stream(stream_id)
2021-08-31 10:37:07,958 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for access_tokens(id)
2021-08-31 10:37:08,106 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for refresh_tokens(id)
2021-08-31 10:37:09,514 - synapse.storage.databases - 85 - INFO - main - [database config 'master']: Starting 'state' database

After looking at postgres logs, two slow queries on startup were identified, accounting for ~30 seconds.

2021-08-31 10:36:54.115 UTC [matrix main] LOG:  duration: 17652.320 ms  statement: SELECT user_id, state, last_active_ts, last_federation_update_ts, last_user_sync_ts, status_msg, currently_active FROM presence_stream WHERE state != 'offline'
[...]
2021-08-31 10:37:07.905 UTC [matrix main] LOG:  duration: 13210.316 ms  statement: SELECT room_id, MAX(stream_ordering) FROM events WHERE stream_ordering > 2273929705 - 100000 GROUP BY room_id

From discussions with @richvdh and @erikjohnston:

The first query picks out 0 rows from an 800,000 row table with a sequential scan. There is one row for each user synapse has seen. For now we plan to add an index to the table.
The second query uses an index and ought to be fast. When run manually, the query does complete reasonably quickly. It's suspected that the concurrent restart of all the synapse workers was giving postgres a bad time and that rolling restarts would improve things.

squahtx commented 3 years ago

Our next step is to check the impact on restart times once these two changes have made it to matrix.org

richvdh commented 3 years ago

It's suspected that the concurrent restart of all the synapse workers was giving postgres a bad time and that rolling restarts would improve things.

Related to rolling restarts: https://github.com/matrix-org/synapse/issues/10793

squahtx commented 3 years ago

In today's restart, we saw that the 30 second delay (actually 35 seconds) has been halved down to 18 seconds:

2021-09-14 10:20:51,840 - synapse.storage.util.id_generators - 56 - INFO - main - initialising stream generator for pushers(id)
2021-09-14 10:20:51,842 - synapse.storage.util.id_generators - 56 - INFO - main - initialising stream generator for deleted_pushers(stream_id)
2021-09-14 10:20:51,844 - synapse.storage.util.id_generators - 56 - INFO - main - initialising stream generator for push_rules_stream(stream_id)
2021-09-14 10:21:09,785 - synapse.storage.util.id_generators - 56 - INFO - main - initialising stream generator for access_tokens(id)
2021-09-14 10:21:09,786 - synapse.storage.util.id_generators - 56 - INFO - main - initialising stream generator for refresh_tokens(id)
2021-09-14 10:21:11,279 - synapse.storage.databases - 85 - INFO - main - [database config 'master']: Starting 'state' database

2021-09-14 10:20:55.415 UTC [matrix main] LOG:  duration: 3556.734 ms  statement: SELECT user_id, MAX(stream_id) FROM push_rules_stream WHERE stream_id > 2616199 - 100000 GROUP BY user_id
2021-09-14 10:20:56.270 UTC [matrix main] LOG:  duration: 639.503 ms  statement: SELECT user_id, MAX(stream_id) FROM presence_stream WHERE stream_id > 757284961 - 100000 GROUP BY user_id
2021-09-14 10:21:09.736 UTC [matrix main] LOG:  duration: 13282.137 ms  statement: SELECT room_id, MAX(stream_ordering) FROM events WHERE stream_ordering > 2308446758 - 100000 GROUP BY room_id

These 3 previously seen queries were responsible for most of those 18 seconds, as expected.

Yoric commented 3 years ago

Anecdata: as an end-user, what I witnessed today was 3+ minutes (I think closer to 6-7min but I didn't pay that much attention after the 3 minutes mark) during which the dialog displayed that connection was lost. Maybe there are also problems somewhere else in the chain.

richvdh commented 3 years ago

I've made a PR to skip the final GC on shutdown which ought to shave off a few seconds: #10712 Sadly while testing on a homeserver in Matrix HQ, I saw lots of variation in the shutdown time due to reactor stalls, ranging from a few seconds to 2 minutes.

The reactor really shouldn't be stalling for 2 minutes, even during shutdown. If it is, that is another bug we should investigate.

On a recent restart of matrix.org, almost all of the workers shut down immediately (within a second of the shutdown request). Exceptions were:

the client readers, which took a couple of seconds
background_worker1.service, and the main process, both of whose shutdowns timed out after 10 seconds and were SIGKILLed.

I'd love to know what the bg worker and the main process were doing for those 10 seconds. I still also think we should remove the dependency between main process and workers: there is no need to wait for all the workers to shut down before beginning to stop the main process.

richvdh commented 3 years ago

Anecdata: as an end-user, what I witnessed today was 3+ minutes (I think closer to 6-7min but I didn't pay that much attention after the 3 minutes mark) during which the dialog displayed that connection was lost. Maybe there are also problems somewhere else in the chain.

so yes, it looks like the first /sync request after restart took over 3 minutes to generate a response. Breaking this down:

13:34:46: shutdown begins
13:34:56: background_worker1 shutdown complete
13:35:06: main process shutdown complete. Startup begins
13:35:50: main process startup complete; worker startup begins
13:36:45: synchrotron1 starts processing requests

13:39:48: synchrotron1 finally responds to @Yoric:

2021-10-04 13:39:48,726 - synapse.access.http.8083 - 421 - INFO - GET-288 - ... - 8083 - {@dotyoric:matrix.org} Processed request: 182.045sec/-1.976sec (0.280sec, 0.030sec) (208.099sec/5.580sec/166) 0B 200! "GET /_matrix/client/r0/sync?filter=0&timeout=0&since=....

I thinl the very slow response - and a lot of the slow worker startup time - is just because we have tens of synapse processes hammering the database after a restart. Rolling restarts should help considerably here.

squahtx commented 3 years ago

I saw lots of variation in the shutdown time due to reactor stalls, ranging from a few seconds to 2 minutes.

The reactor really shouldn't be stalling for 2 minutes, even during shutdown. If it is, that is another bug we should investigate.

At the time I made that observation, #10703 was still an issue, so it may very well have been the cause

DMRobertson commented 3 years ago

Data point: @erikjohnston and @reivilibre were surprised that the restart to upgrade to 1.47.1 seemed noticeably quicker than expected.

DMRobertson commented 2 years ago

Deploying 1.59rc2 to matrix.org today took 42 minutes for 141 workers to sequentially shutdown, restart and notify systemd. I make that roughly 18 seconds per worker.

matrix-org / synapse

synapse takes ages to restart #7968