matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.81k stars 2.13k forks source link

synapse_port_db - KeyError: 'key' #6548

Closed moqmar closed 4 years ago

moqmar commented 4 years ago

Description

After some time (not immediately like with #6544), "synapse_port_db" cancels with an error in line 597, seemingly at a random location (the last "abcd: 100% (54/54)" line varies with every run:

stream_ordering_to_exterm: 7% (1000/13777)
2019-12-14 11:24:52,063 - synapse.storage.database - 446 - WARNING - Starting db txn 'select' from sentinel context
2019-12-14 11:24:52,064 - synapse.storage.database - 485 - WARNING - Starting db connection from sentinel context: metrics will be lost
2019-12-14 11:24:52,174 - synapse.storage.database - 446 - WARNING - Starting db txn 'insert' from sentinel context
2019-12-14 11:24:52,174 - synapse.storage.database - 485 - WARNING - Starting db connection from sentinel context: metrics will be lost
2019-12-14 11:24:52,180 - synapse_port_db - 607 - ERROR - 
Traceback (most recent call last):
  File "/usr/local/bin/synapse_port_db", line 597, in run
    [self.handle_table(*res) for res in setup_res], consumeErrors=True
twisted.internet.defer.FirstError: FirstError[#122, [Failure instance: Traceback: <class 'KeyError'>: 'key'
/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py:501:errback
/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py:568:_startRunCallbacks
/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py:654:_runCallbacks
/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py:1475:gotResult
--- <exception caught here> ---
/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py:1416:_inlineCallbacks
/usr/local/lib/python3.7/site-packages/twisted/python/failure.py:512:throwExceptionIntoGenerator
/usr/local/bin/synapse_port_db:277:handle_table
/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py:1416:_inlineCallbacks
/usr/local/lib/python3.7/site-packages/twisted/python/failure.py:512:throwExceptionIntoGenerator
/usr/local/bin/synapse_port_db:439:handle_search_table
/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py:1416:_inlineCallbacks
/usr/local/lib/python3.7/site-packages/twisted/python/failure.py:512:throwExceptionIntoGenerator
/usr/local/lib/python3.7/site-packages/synapse/storage/database.py:456:runInteraction
/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py:1416:_inlineCallbacks
/usr/local/lib/python3.7/site-packages/twisted/python/failure.py:512:throwExceptionIntoGenerator
/usr/local/lib/python3.7/site-packages/synapse/storage/database.py:504:runWithConnection
/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py:250:inContext
/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py:266:<lambda>
/usr/local/lib/python3.7/site-packages/twisted/python/context.py:122:callWithContext
/usr/local/lib/python3.7/site-packages/twisted/python/context.py:85:callWithContext
/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py:306:_runWithConnection
/usr/local/lib/python3.7/site-packages/twisted/python/compat.py:464:reraise
/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py:297:_runWithConnection
/usr/local/lib/python3.7/site-packages/synapse/storage/database.py:501:inner_func
/usr/local/lib/python3.7/site-packages/synapse/storage/database.py:339:new_transaction
/usr/local/bin/synapse_port_db:425:insert
/usr/local/bin/synapse_port_db:425:<listcomp>
]]
Traceback (most recent call last):
  File "/usr/local/bin/synapse_port_db", line 597, in run
    [self.handle_table(*res) for res in setup_res], consumeErrors=True
twisted.internet.defer.FirstError: FirstError[#122, [Failure instance: Traceback: <class 'KeyError'>: 'key'
/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py:501:errback
/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py:568:_startRunCallbacks
/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py:654:_runCallbacks
/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py:1475:gotResult
--- <exception caught here> ---
/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py:1416:_inlineCallbacks
/usr/local/lib/python3.7/site-packages/twisted/python/failure.py:512:throwExceptionIntoGenerator
/usr/local/bin/synapse_port_db:277:handle_table
/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py:1416:_inlineCallbacks
/usr/local/lib/python3.7/site-packages/twisted/python/failure.py:512:throwExceptionIntoGenerator
/usr/local/bin/synapse_port_db:439:handle_search_table
/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py:1416:_inlineCallbacks
/usr/local/lib/python3.7/site-packages/twisted/python/failure.py:512:throwExceptionIntoGenerator
/usr/local/lib/python3.7/site-packages/synapse/storage/database.py:456:runInteraction
/usr/local/lib/python3.7/site-packages/twisted/internet/defer.py:1416:_inlineCallbacks
/usr/local/lib/python3.7/site-packages/twisted/python/failure.py:512:throwExceptionIntoGenerator
/usr/local/lib/python3.7/site-packages/synapse/storage/database.py:504:runWithConnection
/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py:250:inContext
/usr/local/lib/python3.7/site-packages/twisted/python/threadpool.py:266:<lambda>
/usr/local/lib/python3.7/site-packages/twisted/python/context.py:122:callWithContext
/usr/local/lib/python3.7/site-packages/twisted/python/context.py:85:callWithContext
/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py:306:_runWithConnection
/usr/local/lib/python3.7/site-packages/twisted/python/compat.py:464:reraise
/usr/local/lib/python3.7/site-packages/twisted/enterprise/adbapi.py:297:_runWithConnection
/usr/local/lib/python3.7/site-packages/synapse/storage/database.py:501:inner_func
/usr/local/lib/python3.7/site-packages/synapse/storage/database.py:339:new_transaction
/usr/local/bin/synapse_port_db:425:insert
/usr/local/bin/synapse_port_db:425:<listcomp>
]]

Version information

moqmar commented 4 years ago

Small explanation on how I'm running synapse_port_db with Docker:

  1. Stop the synapse container
  2. Start the shell with docker-compose run --entrypoint /bin/ash --rm synapse
  3. Make sure that the PostgreSQL database is not running
  4. Compile the configuration with ./start.py, if it's done compiling the config, exit Synapse with Ctrl+C
  5. Start the database server
  6. Run the script with synapse_port_db --sqlite-database /data/homeserver.db.snapshot --postgres-config /compiled/homeserver.yaml
anoadragon453 commented 4 years ago

Hi @moqmar. I'm unable to reproduce this following your instructions with a current copy of Synapse v1.12.1rc1, however my guess from the traceback is that the issue may have been caused by this line:

https://github.com/matrix-org/synapse/blob/e577c5d607b1edb827a5a137dd77587e394fdf5b/scripts/synapse_port_db#L428

Where this SQL:

https://github.com/matrix-org/synapse/blob/e577c5d607b1edb827a5a137dd77587e394fdf5b/scripts/synapse_port_db#L384-L388

was able to pull a row from the event_search table that had a NULL 'key' column. This shouldn't be possible as that field isn't marked as nullable, but perhaps some database corruption occurred. Do you still have access to a database with this problem? I'd like to run some tests with it if so.

As a workaround you can try searching for any affected rows with the following SQL:

SELECT * FROM event_search WHERE key is NULL;

If any rows turn up, replace the NULL with the correct 'key' column value (look at adjacent rows for examples), or just feel free to delete the affected rows instead. As this table is just for searching for events, so removing a few entries is harmless. Afterwards, check if problems with porting still persists.

moqmar commented 4 years ago

Hm, doesn't look like that's the issue... grafik This is what the table generally looks like on my instance: grafik

anoadragon453 commented 4 years ago

Huh, I can see from your database that your column is named KEY instead of key.

I'm not sure how your database ended up that way, but that would explain not being able to retrieve key from the row.

Either you should rename the db column, or change synapse_port_db by editing:

row["key"]

to

row["KEY"]
moqmar commented 4 years ago

I can probably test this on Sunday, synapse_port_db currently says there's something wrong with the default collation of the Postgres Docker image, but I'm also using a bit of an older version probably...

anoadragon453 commented 4 years ago

Having the incorrect collation is something that's known to cause DB corruption, so that may be the cause of things. Although I hope the docker image wasn't the one responsible for creating it.

anoadragon453 commented 4 years ago

@moqmar Did you get a chance to test this?

moqmar commented 4 years ago

Finally got a chance to test this, and yes, changing key to KEY in the migration script fixed the issue! Thanks for your help! Do you want to keep this issue open to add the wrong column name as a fallback in the script? I have no idea where it came from, never changed anything with the database.

Having the incorrect collation is something that's known to cause DB corruption, so that may be the cause of things. Although I hope the docker image wasn't the one responsible for creating it.

Seems like that was just because I used the default Docker image of Postres, and it automatically creates a database with the wrong collation - I recreated it with the correct one according to the instructions in https://github.com/matrix-org/synapse/blob/master/docs/postgres.md, and got the script to run that way.

anoadragon453 commented 4 years ago

@moqmar Glad to hear it's resolved!

I don't see anything in the codebase that might've caused this... so I'm willing to believe it was random corruption, and if that's the case then it's likely enough to happen to any column, not just key. If we come across another cause then we will of course file and fix it.

Seems like that was just because I used the default Docker image of Postres, and it automatically creates a database with the wrong collation.

Hmm, I'll have a look at that - seems like it will be a subtle but confusing issue for sysadmins down the road.

Closing this issue for now, thanks again!

richvdh commented 4 years ago

Seems like that was just because I used the default Docker image of Postres, and it automatically creates a database with the wrong collation.

Hmm, I'll have a look at that - seems like it will be a subtle but confusing issue for sysadmins down the road.

If the database is created by the postgres docker image, that's not a thing we can change, since we aren't responsible for those images.

In any case, synapse now refuses to initialise a new database with the wrong collation, so it will be much harder to fall into this trap.