Open richvdh opened 4 years ago
keywords: restart
A lot of the problem here is this query:
SELECT instance_name, MAX(stream_id) FROM cache_invalidation_stream_by_instance GROUP BY instance_name
... which takes six seconds on a relatively idle database.
this one takes 11 seconds:
SELECT COUNT(*) FROM users WHERE name NOT LIKE '%:matrix.org'
hopefully the latter will be fixed by #8271.
I've also manually cleared out cache_invalidation_stream_by_instance
, which seems to have mostly resolved this problem, so I'm going to close this and leave #8269 to track a more permanent fix.
it's still pretty slow, so I'm not sure it's fixed.
shutting down also takes ages: that seems to be mostly because python runs a final GC at shutdown, which can take many seconds. Maybe we can skip that GC. (or maybe it improves if we fix https://github.com/matrix-org/synapse/issues/7176)
(or maybe it improves if we fix #7176)
given we're running zope.interface-4.7.1 on matrix.org, that's sadly unlikely.
(or maybe it improves if we fix #7176)
given we're running zope.interface-4.7.1 on matrix.org, that's sadly unlikely.
Per https://instagram-engineering.com/dismissing-python-garbage-collection-at-instagram-4dca40b29172, we could skip the final GC with an early atexit.register(os._exit, 0)
.
I think I want to make this a release blocker; it feels like we've had a big regression here in the last couple of weeks.
I just tried restarting an idle worker:
Oct 01 15:21:52 hippogriff.matrix.org systemd[1882]: Stopping Synapse test_worker1 worker...
Oct 01 15:21:53 hippogriff.matrix.org systemd[1882]: synapse-worker@test_worker1.service: Succeeded.
Oct 01 15:21:53 hippogriff.matrix.org systemd[1882]: Stopped Synapse test_worker1 worker.
Oct 01 15:21:53 hippogriff.matrix.org systemd[1882]: Starting Synapse test_worker1 worker...
Oct 01 15:22:09 hippogriff.matrix.org systemd[1882]: synapse-worker@test_worker1.service: Main process exited, code=killed, status=15/TERM
Oct 01 15:22:09 hippogriff.matrix.org systemd[1882]: synapse-worker@test_worker1.service: Succeeded.
Oct 01 15:22:09 hippogriff.matrix.org systemd[1882]: Stopped Synapse test_worker1 worker.
Oct 01 15:22:09 hippogriff.matrix.org systemd[1882]: Starting Synapse test_worker1 worker...
Oct 01 15:22:35 hippogriff.matrix.org synctl[8957]: started synapse.app.generic_worker('/home/synapse/config/workers/test_worker1.yaml')
Oct 01 15:22:35 hippogriff.matrix.org systemd[1882]: Started Synapse test_worker1 worker.
42 seconds to restart an idle process is no good.
I've also seen the startup sequence take more than 60 seconds (in addition to schema migrations)
This is now much better thanks to #8447. I still think there's work to be done here (on matrix.org, shutdown takes 12 seconds; startup takes 11) so will leave this open.
I just had another restart which took 83 seconds:
A few thoughts:
2021-01-05 11:36:40,797 - synapse.storage.util.id_generators - 57 - INFO - None - initialising stream generator for presence_stream(stream_id)
can we make that faster?
2021-01-05 11:36:14,787 - synapse.storage.util.id_generators - 57 - INFO - None - initialising stream generator for push_rules_stream(stream_id)
2021-01-05 11:36:21,874 - synapse.storage.util.id_generators - 57 - INFO - None - initialising stream generator for access_tokens(id)
2021-01-05 11:36:29,288 - synapse.storage.databases - 83 - INFO - None - [database config 'master']: Starting 'state' database
apparently the stream generator for push_rules_stream
now takes 30 seconds to initialise:
2021-05-26 09:57:14,498 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for pushers(id)
2021-05-26 09:57:14,502 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for deleted_pushers(stream_id)
2021-05-26 09:57:14,503 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for push_rules_stream(stream_id)
2021-05-26 09:57:44,476 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for access_tokens(id)
2021-05-26 09:57:45,615 - synapse.storage.databases - 85 - INFO - main - [database config 'master']: Starting 'state' database
(or maybe it improves if we fix #7176)
given we're running zope.interface-4.7.1 on matrix.org, that's sadly unlikely.
Per https://instagram-engineering.com/dismissing-python-garbage-collection-at-instagram-4dca40b29172, we could skip the final GC with an early
atexit.register(os._exit, 0)
.
We need to be careful that we still run the functions we registered to run when the reactor stops
I've made a PR to skip the final GC on shutdown which ought to shave off a few seconds: #10712 Sadly while testing on a homeserver in Matrix HQ, I saw lots of variation in the shutdown time due to reactor stalls, ranging from a few seconds to 2 minutes.
We saw the same 30 second delay after initialising the stream generator for push_rules_stream
again:
2021-08-31 10:36:32,594 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for pushers(id)
2021-08-31 10:36:32,606 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for deleted_pushers(stream_id)
2021-08-31 10:36:32,613 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for push_rules_stream(stream_id)
2021-08-31 10:37:07,958 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for access_tokens(id)
2021-08-31 10:37:08,106 - synapse.storage.util.id_generators - 55 - INFO - main - initialising stream generator for refresh_tokens(id)
2021-08-31 10:37:09,514 - synapse.storage.databases - 85 - INFO - main - [database config 'master']: Starting 'state' database
After looking at postgres logs, two slow queries on startup were identified, accounting for ~30 seconds.
2021-08-31 10:36:54.115 UTC [matrix main] LOG: duration: 17652.320 ms statement: SELECT user_id, state, last_active_ts, last_federation_update_ts, last_user_sync_ts, status_msg, currently_active FROM presence_stream WHERE state != 'offline'
[...]
2021-08-31 10:37:07.905 UTC [matrix main] LOG: duration: 13210.316 ms statement: SELECT room_id, MAX(stream_ordering) FROM events WHERE stream_ordering > 2273929705 - 100000 GROUP BY room_id
From discussions with @richvdh and @erikjohnston:
The first query picks out 0 rows from an 800,000 row table with a sequential scan. There is one row for each user synapse has seen. For now we plan to add an index to the table.
The second query uses an index and ought to be fast. When run manually, the query does complete reasonably quickly. It's suspected that the concurrent restart of all the synapse workers was giving postgres a bad time and that rolling restarts would improve things.
Our next step is to check the impact on restart times once these two changes have made it to matrix.org
It's suspected that the concurrent restart of all the synapse workers was giving postgres a bad time and that rolling restarts would improve things.
Related to rolling restarts: https://github.com/matrix-org/synapse/issues/10793
In today's restart, we saw that the 30 second delay (actually 35 seconds) has been halved down to 18 seconds:
2021-09-14 10:20:51,840 - synapse.storage.util.id_generators - 56 - INFO - main - initialising stream generator for pushers(id)
2021-09-14 10:20:51,842 - synapse.storage.util.id_generators - 56 - INFO - main - initialising stream generator for deleted_pushers(stream_id)
2021-09-14 10:20:51,844 - synapse.storage.util.id_generators - 56 - INFO - main - initialising stream generator for push_rules_stream(stream_id)
2021-09-14 10:21:09,785 - synapse.storage.util.id_generators - 56 - INFO - main - initialising stream generator for access_tokens(id)
2021-09-14 10:21:09,786 - synapse.storage.util.id_generators - 56 - INFO - main - initialising stream generator for refresh_tokens(id)
2021-09-14 10:21:11,279 - synapse.storage.databases - 85 - INFO - main - [database config 'master']: Starting 'state' database
2021-09-14 10:20:55.415 UTC [matrix main] LOG: duration: 3556.734 ms statement: SELECT user_id, MAX(stream_id) FROM push_rules_stream WHERE stream_id > 2616199 - 100000 GROUP BY user_id
2021-09-14 10:20:56.270 UTC [matrix main] LOG: duration: 639.503 ms statement: SELECT user_id, MAX(stream_id) FROM presence_stream WHERE stream_id > 757284961 - 100000 GROUP BY user_id
2021-09-14 10:21:09.736 UTC [matrix main] LOG: duration: 13282.137 ms statement: SELECT room_id, MAX(stream_ordering) FROM events WHERE stream_ordering > 2308446758 - 100000 GROUP BY room_id
These 3 previously seen queries were responsible for most of those 18 seconds, as expected.
Anecdata: as an end-user, what I witnessed today was 3+ minutes (I think closer to 6-7min but I didn't pay that much attention after the 3 minutes mark) during which the dialog displayed that connection was lost. Maybe there are also problems somewhere else in the chain.
I've made a PR to skip the final GC on shutdown which ought to shave off a few seconds: #10712 Sadly while testing on a homeserver in Matrix HQ, I saw lots of variation in the shutdown time due to reactor stalls, ranging from a few seconds to 2 minutes.
The reactor really shouldn't be stalling for 2 minutes, even during shutdown. If it is, that is another bug we should investigate.
On a recent restart of matrix.org
, almost all of the workers shut down immediately (within a second of the shutdown request). Exceptions were:
background_worker1.service
, and the main process, both of whose shutdowns timed out after 10 seconds and were SIGKILLed.I'd love to know what the bg worker and the main process were doing for those 10 seconds. I still also think we should remove the dependency between main process and workers: there is no need to wait for all the workers to shut down before beginning to stop the main process.
Anecdata: as an end-user, what I witnessed today was 3+ minutes (I think closer to 6-7min but I didn't pay that much attention after the 3 minutes mark) during which the dialog displayed that connection was lost. Maybe there are also problems somewhere else in the chain.
so yes, it looks like the first /sync
request after restart took over 3 minutes to generate a response. Breaking this down:
background_worker1
shutdown complete2021-10-04 13:39:48,726 - synapse.access.http.8083 - 421 - INFO - GET-288 - ... - 8083 - {@dotyoric:matrix.org} Processed request: 182.045sec/-1.976sec (0.280sec, 0.030sec) (208.099sec/5.580sec/166) 0B 200! "GET /_matrix/client/r0/sync?filter=0&timeout=0&since=....
I thinl the very slow response - and a lot of the slow worker startup time - is just because we have tens of synapse processes hammering the database after a restart. Rolling restarts should help considerably here.
I saw lots of variation in the shutdown time due to reactor stalls, ranging from a few seconds to 2 minutes.
The reactor really shouldn't be stalling for 2 minutes, even during shutdown. If it is, that is another bug we should investigate.
At the time I made that observation, #10703 was still an issue, so it may very well have been the cause
Data point: @erikjohnston and @reivilibre were surprised that the restart to upgrade to 1.47.1 seemed noticeably quicker than expected.
Deploying 1.59rc2 to matrix.org today took 42 minutes for 141 workers to sequentially shutdown, restart and notify systemd. I make that roughly 18 seconds per worker.
Why does it take 70 seconds to initialise the database? It might be some sort of database connection contention but it would be nice if there were some log messages or something.
what's going on there?