Open squahtx opened 1 year ago
Is it possible to make the sqlite schema deltas transactional too? (Were they not before---did I break something in #13873?)
I forgot to mention, that understanding was based on the docstring for executescript()
. I don't think #13873 broke it.
https://github.com/matrix-org/synapse/blob/8c94dd3a277d4e11192f98a9ca32cb6638606b66/synapse/storage/engines/sqlite.py#L127-L144
Is it possible to make the sqlite schema deltas transactional too?
We could try injecting BEGIN
and COMMIT
statements into the script and see what happens.
According to https://stackoverflow.com/questions/4692690/is-it-possible-to-roll-back-create-table-and-alter-table-statements-in-major-sql, sqlite does support transactional DDL. It's only MySQL that doesn't.
Probably a duplicate of #14100, also see @richvdh's comment in #13193, which I kind of agree with:
if you've got a big enough sqlite database to notice the delay, you're doing it wrong.
Use postgres or run update_synapse_database. Nothing wrong with preventing corruption in case this problem occurs, but I wouldn't try to support even larger sqlite instances through arbitrary timeouts (until someone hits even that limit).
When using sqlite, schema deltas are not fully transactional (see docstring here). The schema delta will end up being half applied and the database will end up in an inconsistent state. Depending on the nature of the schema delta, the next run may fail.
It looks like this may have been introduced by https://github.com/matrix-org/synapse/pull/13873, which replaced the old handrolled code for parsing sql files with something more conventional. That was probably a good idea, but it does break the transactional semantics we were relying on for our migrations...
I wonder if it would be better to split this into two issues:
The latter could probably be fixed by "just" making sure that the transaction gets committed after the update to applied_schema_deltas
. The former is a bit tricker.
Our server was affected by the problem. After an upgrade (1.68 to 1.75) it only loops:
synapse.app._base - 207 - ERROR - main - Exception during startup Traceback (most recent call last): File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/app/homeserver.py", line 340, in setup hs.setup() File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/server.py", line 310, in setup self.datastores = Databases(self.DATASTORE_CLASS, self) File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/storage/databases/init.py", line 74, in init prepare_database( File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/storage/prepare_database.py", line 136, in prepare_database _upgrade_existing_database( File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/storage/prepare_database.py", line 520, in _upgrade_existing_database database_engine.execute_script_file(cur, absolute_path) File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/storage/engines/_base.py", line 145, in execute_script_file cls.executescript(cursor, f.read()) File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/storage/engines/sqlite.py", line 144, in executescript cursor.executescript(script) File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/storage/database.py", line 395, in executescript self._do_execute(self.txn.executescript, sql) # type: ignore[attr-defined] File "/opt/venvs/matrix-synapse/lib/python3.8/site-packages/synapse/storage/database.py", line 436, in _do_execute return func(sql, *args, **kwargs) sqlite3.OperationalError: table threads already exists
With the help of @richvdh I was able to do the following and get the server running again:
- DROP TABLE threads;
- DELETE FROM background_updates WHERE update_name='threads_backfill';
Be aware that this is not a general solution to this issue: it is only appropriate if the upgrade fails at a very specific point. If your server is hanging somewhere else in the upgrade cycle, these commands could cause data loss and additional problems.
When using sqlite, schema deltas are not fully transactional (see docstring here). The schema delta will end up being half applied and the database will end up in an inconsistent state. Depending on the nature of the schema delta, the next run may fail.
It looks like this may have been introduced by #13873, which replaced the old handrolled code for parsing sql files with something more conventional. That was probably a good idea, but it does break the transactional semantics we were relying on for our migrations...
Sorry, that's my mistake. FWIW the old parser still exists, see e.g. https://github.com/matrix-org/synapse/blob/0a38c7ec6d46b6e51bfa53ff44e51637d3c63f5c/synapse/storage/prepare_database.py#L609
The only other option would be to inject an explicit BEGIN TRANSACTION
and COMMIT
before and after the script. But that's not a like-for-like replacement for what had before. Maybe it's best to revert back.
(I recall that when I was making the full schema dumps that parser couldn't handle something postgres-specific. Trigger definitions maybe?)
The only other option would be to inject an explicit
BEGIN TRANSACTION
andCOMMIT
before and after the script. But that's not a like-for-like replacement for what had before
It's not, but I think it would have the right semantics, and would avoid us maintaining our own SQL parser, with all the pitfalls that brings.
The non-transactional execution on sqlite has been split out into #14909.
Resolved. Wrong systemd config where ExecStartPre
never finished
Not sure if this is worth a separate issue.
@aceArt-GmbH The logs don't show that Synapse is getting killed during the schema migration:
2023-09-26 10:43:46,003 - synapse.storage.prepare_database - 561 - INFO - main - Schema now up to date ... 2023-09-26 10:45:15,185 - twisted - 275 - INFO - sentinel - Received SIGTERM, shutting down.
You should look at systemd's logs to see what is happening. If they show evidence of a bug in Synapse, please open a new issue.
When a schema delta takes too long, Synapse will get restarted by systemd, possibly repeatedly.
~~When using postgres, schema deltas are transactional. Synapse will run the schema delta and time out again. ~~
When using sqlite, schema deltas are not fully transactional (see docstring here). The schema delta will end up being half applied and the database will end up in an inconsistent state. Depending on the nature of the schema delta, the next run may fail.Tracked in #14909.This manifested as:
13193
14100
14333
Things we may consider doing:
dmr: Make sqlite schema deltas transactional too?Tracked in #14909.