matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.78k stars 2.12k forks source link

Long running schema deltas are killed by systemd and lead to bootlooping #14401

Open squahtx opened 1 year ago

squahtx commented 1 year ago

When a schema delta takes too long, Synapse will get restarted by systemd, possibly repeatedly.

~~When using postgres, schema deltas are transactional. Synapse will run the schema delta and time out again. ~~ When using sqlite, schema deltas are not fully transactional (see docstring here). The schema delta will end up being half applied and the database will end up in an inconsistent state. Depending on the nature of the schema delta, the next run may fail. Tracked in #14909.

This manifested as:

Things we may consider doing:

DMRobertson commented 1 year ago

Is it possible to make the sqlite schema deltas transactional too? (Were they not before---did I break something in #13873?)

squahtx commented 1 year ago

I forgot to mention, that understanding was based on the docstring for executescript(). I don't think #13873 broke it. https://github.com/matrix-org/synapse/blob/8c94dd3a277d4e11192f98a9ca32cb6638606b66/synapse/storage/engines/sqlite.py#L127-L144

squahtx commented 1 year ago

Is it possible to make the sqlite schema deltas transactional too?

We could try injecting BEGIN and COMMIT statements into the script and see what happens. According to https://stackoverflow.com/questions/4692690/is-it-possible-to-roll-back-create-table-and-alter-table-statements-in-major-sql, sqlite does support transactional DDL. It's only MySQL that doesn't.

rltas commented 1 year ago

Probably a duplicate of #14100, also see @richvdh's comment in #13193, which I kind of agree with:

if you've got a big enough sqlite database to notice the delay, you're doing it wrong.

Use postgres or run update_synapse_database. Nothing wrong with preventing corruption in case this problem occurs, but I wouldn't try to support even larger sqlite instances through arbitrary timeouts (until someone hits even that limit).

richvdh commented 1 year ago

When using sqlite, schema deltas are not fully transactional (see docstring here). The schema delta will end up being half applied and the database will end up in an inconsistent state. Depending on the nature of the schema delta, the next run may fail.

It looks like this may have been introduced by https://github.com/matrix-org/synapse/pull/13873, which replaced the old handrolled code for parsing sql files with something more conventional. That was probably a good idea, but it does break the transactional semantics we were relying on for our migrations...

richvdh commented 1 year ago

I wonder if it would be better to split this into two issues:

The latter could probably be fixed by "just" making sure that the transaction gets committed after the update to applied_schema_deltas. The former is a bit tricker.

mi-gri commented 1 year ago

Our server was affected by the problem. After an upgrade (1.68 to 1.75) it only loops:

synapse.app._base - 207 - ERROR - main - Exception during startup Traceback (most recent call last): File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/app/homeserver.py", line 340, in setup hs.setup() File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/server.py", line 310, in setup self.datastores = Databases(self.DATASTORE_CLASS, self) File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/storage/databases/init.py", line 74, in init prepare_database( File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/storage/prepare_database.py", line 136, in prepare_database _upgrade_existing_database( File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/storage/prepare_database.py", line 520, in _upgrade_existing_database database_engine.execute_script_file(cur, absolute_path) File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/storage/engines/_base.py", line 145, in execute_script_file cls.executescript(cursor, f.read()) File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/storage/engines/sqlite.py", line 144, in executescript cursor.executescript(script) File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/storage/database.py", line 395, in executescript self._do_execute(self.txn.executescript, sql) # type: ignore[attr-defined] File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/storage/database.py", line 436, in _do_execute return func(sql, *args, **kwargs) sqlite3.OperationalError: table threads already exists

With the help of @richvdh I was able to do the following and get the server running again:

  1. stop the server if it is still running (systemctl stop matrix-synapse)
  2. backup the database file (e.g. cp homeserver.db homeserver.db.bak)
  3. sqlite3 homeserver.db
  4. DROP TABLE threads;
  5. DELETE FROM background_updates WHERE update_name='threads_backfill';
  6. .exit
  7. update_synapse_database --database-config /etc/matrix-synapse/homeserver.yaml and if that finished without issues:
  8. restart the server (systemctl start matrix-synapse)
richvdh commented 1 year ago
  • DROP TABLE threads;
  • DELETE FROM background_updates WHERE update_name='threads_backfill';

Be aware that this is not a general solution to this issue: it is only appropriate if the upgrade fails at a very specific point. If your server is hanging somewhere else in the upgrade cycle, these commands could cause data loss and additional problems.

DMRobertson commented 1 year ago

When using sqlite, schema deltas are not fully transactional (see docstring here). The schema delta will end up being half applied and the database will end up in an inconsistent state. Depending on the nature of the schema delta, the next run may fail.

It looks like this may have been introduced by #13873, which replaced the old handrolled code for parsing sql files with something more conventional. That was probably a good idea, but it does break the transactional semantics we were relying on for our migrations...

Sorry, that's my mistake. FWIW the old parser still exists, see e.g. https://github.com/matrix-org/synapse/blob/0a38c7ec6d46b6e51bfa53ff44e51637d3c63f5c/synapse/storage/prepare_database.py#L609

The only other option would be to inject an explicit BEGIN TRANSACTION and COMMIT before and after the script. But that's not a like-for-like replacement for what had before. Maybe it's best to revert back.

(I recall that when I was making the full schema dumps that parser couldn't handle something postgres-specific. Trigger definitions maybe?)

richvdh commented 1 year ago

The only other option would be to inject an explicit BEGIN TRANSACTION and COMMIT before and after the script. But that's not a like-for-like replacement for what had before

It's not, but I think it would have the right semantics, and would avoid us maintaining our own SQL parser, with all the pitfalls that brings.

squahtx commented 1 year ago

The non-transactional execution on sqlite has been split out into #14909.

aceArt-GmbH commented 11 months ago
Not sure if this is worth a separate issue. Synapse 1.92.3. Postgres 16 Fresh database
synapse logs ``` 2023-09-26 10:43:45,982 - root - 348 - WARNING - main - ***** STARTING SERVER ***** 2023-09-26 10:43:45,982 - root - 349 - WARNING - main - Server /opt/venvs/matrix-synapse/lib/python3.10/site-packages/synapse/app/homeserver.py version 1.92.3 2023-09-26 10:43:45,982 - root - 354 - INFO - main - Server hostname: my.server 2023-09-26 10:43:45,983 - root - 355 - INFO - main - Instance name: master 2023-09-26 10:43:45,983 - root - 356 - INFO - main - Twisted reactor: EPollReactor 2023-09-26 10:43:45,983 - synapse.app.homeserver - 350 - INFO - main - Setting up server 2023-09-26 10:43:45,983 - synapse.server - 338 - INFO - main - Setting up. 2023-09-26 10:43:46,000 - synapse.storage.databases - 66 - INFO - main - [database config 'master']: Checking database server 2023-09-26 10:43:46,001 - synapse.storage.databases - 69 - INFO - main - [database config 'master']: Preparing for databases ['main', 'state'] 2023-09-26 10:43:46,001 - synapse.storage.prepare_database - 128 - INFO - main - ['main', 'state']: Checking existing schema version 2023-09-26 10:43:46,002 - synapse.storage.prepare_database - 132 - INFO - main - ['main', 'state']: Existing schema is 81 (+0 deltas) 2023-09-26 10:43:46,002 - synapse.storage.databases.main - 339 - INFO - main - Checking database for consistency with configuration... 2023-09-26 10:43:46,002 - synapse.storage.prepare_database - 426 - INFO - main - Applying schema deltas for v81 2023-09-26 10:43:46,003 - synapse.storage.prepare_database - 561 - INFO - main - Schema now up to date 2023-09-26 10:43:46,003 - synapse.storage.databases - 84 - INFO - main - [database config 'master']: Starting 'main' database 2023-09-26 10:43:46,012 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for pushers(id): 1 2023-09-26 10:43:46,012 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for deleted_pushers(stream_id): 1 2023-09-26 10:43:46,012 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for push_rules_stream(stream_id): 1 2023-09-26 10:43:46,014 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for e2e_cross_signing_keys(stream_id): 1 2023-09-26 10:43:46,014 - synapse.storage.databases.main.event_push_actions - 1152 - INFO - main - Searching for stream ordering 1 month ago 2023-09-26 10:43:46,014 - synapse.storage.databases.main.event_push_actions - 1156 - INFO - main - Found stream ordering 1 month ago: it's 0 2023-09-26 10:43:46,014 - synapse.storage.databases.main.event_push_actions - 1159 - INFO - main - Searching for stream ordering 1 day ago 2023-09-26 10:43:46,015 - synapse.storage.databases.main.event_push_actions - 1163 - INFO - main - Found stream ordering 1 day ago: it's 0 2023-09-26 10:43:46,015 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for push_rules(id): 1 2023-09-26 10:43:46,015 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for push_rules_enable(id): 1 2023-09-26 10:43:46,017 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for access_tokens(id): 1 2023-09-26 10:43:46,017 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for refresh_tokens(id): 1 2023-09-26 10:43:46,018 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for event_reports(id): 1 2023-09-26 10:43:46,018 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for device_lists_stream(stream_id): 1 2023-09-26 10:43:46,018 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for user_signature_stream(stream_id): 1 2023-09-26 10:43:46,018 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for device_lists_outbound_pokes(stream_id): 1 2023-09-26 10:43:46,019 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for device_lists_changes_in_room(stream_id): 1 2023-09-26 10:43:46,019 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for device_lists_remote_pending(stream_id): 1 2023-09-26 10:43:46,019 - synapse.storage.util.id_generators - 92 - INFO - main - Initialising stream generator for device_lists_changes_converted_stream_position(stream_id): 1 2023-09-26 10:43:46,020 - synapse.storage.databases - 101 - INFO - main - [database config 'master']: Starting 'state' database 2023-09-26 10:43:46,020 - synapse.storage.databases - 116 - INFO - main - [database config 'master']: prepared 2023-09-26 10:43:46,020 - synapse.server - 341 - INFO - main - Finished setting up. 2023-09-26 10:43:46,021 - twisted - 275 - INFO - main - Redirected stdout/stderr to logs 2023-09-26 10:43:46,021 - synapse.app.homeserver - 175 - INFO - sentinel - Running 2023-09-26 10:43:46,023 - synapse.app.homeserver - 29 - INFO - sentinel - Set file limit to: 524288 2023-09-26 10:43:46,063 - synapse.federation.federation_server - 1458 - INFO - sentinel - Registering federation query handler for 'profile' 2023-09-26 10:43:46,063 - synapse.push.pusher - 38 - INFO - sentinel - email enable notifs: False 2023-09-26 10:43:46,066 - synapse.server - 885 - INFO - sentinel - Connecting to redis (path='/var/run/redis/redis-server.sock') for external cache 2023-09-26 10:43:46,066 - synapse.replication.tcp.redis - 285 - INFO - sentinel - Connecting to redis server UNIXAddress('/var/run/redis/redis-server.sock') 2023-09-26 10:43:46,067 - synapse.federation.federation_server - 1438 - INFO - sentinel - Registering federation EDU handler for 'm.device_list_update' 2023-09-26 10:43:46,067 - sso_mapper - 19 - INFO - sentinel - MAPPING:__init__:{"SSO_me_url": "https://api.my.server/idm-dil/rest/me"} 2023-09-26 10:43:46,174 - synapse.http.client - 414 - INFO - sentinel - Received response to GET https://idp.myserver.de:8443/realms/national/protocol/openid-connect/certs: 200 2023-09-26 10:43:46,175 - shared_secret_authenticator - 55 - INFO - sentinel - Enabled login types: ['com.devture.shared_secret_auth', 'm.login.password'] 2023-09-26 10:43:46,175 - synapse.app._base - 579 - INFO - sentinel - Loaded module 2023-09-26 10:43:46,175 - synapse.util.caches.lrucache - 223 - INFO - sentinel - Expiring LRU caches after 1800 seconds 2023-09-26 10:43:46,176 - synapse.federation.federation_server - 1438 - INFO - sentinel - Registering federation EDU handler for 'm.presence' 2023-09-26 10:43:46,177 - synapse.federation.federation_server - 1458 - INFO - sentinel - Registering federation query handler for 'directory' 2023-09-26 10:43:46,177 - synapse.replication.tcp.redis - 285 - INFO - sentinel - Connecting to redis server UNIXAddress('/var/run/redis/redis-server.sock') 2023-09-26 10:43:46,178 - synapse.handlers.pagination - 181 - INFO - sentinel - Setting up purge job with config: RetentionPurgeJob(interval=86400000, shortest_max_lifetime=None, longest_max_lifetime=None) 2023-09-26 10:43:46,180 - synapse.federation.federation_server - 1438 - INFO - sentinel - Registering federation EDU handler for 'm.signing_key_update' 2023-09-26 10:43:46,180 - synapse.federation.federation_server - 1438 - INFO - sentinel - Registering federation EDU handler for 'org.matrix.signing_key_update' 2023-09-26 10:43:46,180 - synapse.federation.federation_server - 1458 - INFO - sentinel - Registering federation query handler for 'client_keys' 2023-09-26 10:43:46,183 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching to path b'/health' 2023-09-26 10:43:46,183 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching to path b'/_matrix/client' 2023-09-26 10:43:46,183 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching to path b'/.well-known' 2023-09-26 10:43:46,183 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching to path b'/_synapse/admin' 2023-09-26 10:43:46,183 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching to path b'/_synapse/client/pick_idp' 2023-09-26 10:43:46,183 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching to path b'/_synapse/client/pick_username' 2023-09-26 10:43:46,183 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching to path b'/_synapse/client/new_user_consent' 2023-09-26 10:43:46,183 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching to path b'/_synapse/client/sso_register' 2023-09-26 10:43:46,183 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching to path b'/_synapse/client/unsubscribe' 2023-09-26 10:43:46,183 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching to path b'/_synapse/client/oidc' 2023-09-26 10:43:46,183 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching FilePath('/opt/venvs/matrix-synapse/lib/python3.10/site-packages/synapse/static') to path b'/_matrix/static' 2023-09-26 10:43:46,183 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching to path b'/_matrix/media/r0' 2023-09-26 10:43:46,183 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching to path b'/_matrix/media/v3' 2023-09-26 10:43:46,183 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching to path b'/_matrix/media/v1' 2023-09-26 10:43:46,184 - twisted - 275 - INFO - sentinel - SynapseSite starting on '/sock/8000.sock' 2023-09-26 10:43:46,184 - synapse.app._base - 442 - INFO - sentinel - Synapse now listening on Unix Socket at: /sock/8000.sock 2023-09-26 10:43:46,187 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching to path b'/health' 2023-09-26 10:43:46,187 - synapse.util.httpresourcetree - 49 - INFO - sentinel - Attaching to path b'/_synapse/replication' 2023-09-26 10:43:46,187 - twisted - 275 - INFO - sentinel - SynapseSite starting on '/sock/9000.sock' 2023-09-26 10:43:46,187 - synapse.app._base - 442 - INFO - sentinel - Synapse now listening on Unix Socket at: /sock/9000.sock 2023-09-26 10:43:46,237 - synapse.storage.background_updates - 400 - INFO - background_updates-0 - Starting background schema updates for database master 2023-09-26 10:43:46,238 - synapse.replication.tcp.redis - 119 - INFO - sentinel - Connected to redis 2023-09-26 10:43:46,239 - synapse.replication.tcp.redis - 131 - INFO - subscribe-replication-0 - Sending redis SUBSCRIBE for ['my.server/USER_IP', 'my.server'] 2023-09-26 10:43:46,239 - synapse.push.pusherpool - 373 - INFO - start_pushers-0 - Started pushers 2023-09-26 10:43:46,239 - synapse.storage.background_updates - 418 - INFO - background_updates-0 - No more background updates to do. Unscheduling background update task. 2023-09-26 10:43:46,239 - synapse.replication.tcp.redis - 134 - INFO - subscribe-replication-0 - Successfully subscribed to redis stream, sending REPLICATE command 2023-09-26 10:43:46,240 - synapse.replication.tcp.redis - 139 - INFO - subscribe-replication-0 - REPLICATE successfully sent 2023-09-26 10:44:16,177 - synapse.util.caches.lrucache - 210 - INFO - LruCache._expire_old_entries-0 - Dropped 0 items from caches 2023-09-26 10:44:46,176 - synapse.util.caches.lrucache - 210 - INFO - LruCache._expire_old_entries-1 - Dropped 0 items from caches 2023-09-26 10:45:15,185 - twisted - 275 - INFO - sentinel - Received SIGTERM, shutting down. ```
Seems like systemd SIGTERMs synapse. Increasing service `TimeoutSec` just delays things. Postgres is not showing any running queries. Connections are idle

Resolved. Wrong systemd config where ExecStartPre never finished

DMRobertson commented 11 months ago

Not sure if this is worth a separate issue.

@aceArt-GmbH The logs don't show that Synapse is getting killed during the schema migration:

2023-09-26 10:43:46,003 - synapse.storage.prepare_database - 561 - INFO - main - Schema now up to date ... 2023-09-26 10:45:15,185 - twisted - 275 - INFO - sentinel - Received SIGTERM, shutting down.

You should look at systemd's logs to see what is happening. If they show evidence of a bug in Synapse, please open a new issue.