Closed anoadragon453 closed 4 years ago
This is happening with the first execution of _update_current_state_txn
:
def _update_current_state_txn(self, txn, state_delta_by_room, stream_id):
for room_id, current_state_tuple in iteritems(state_delta_by_room):
to_delete, to_insert = current_state_tuple
# First we add entries to the current_state_delta_stream. We
# do this before updating the current_state_events table so
# that we can use it to calculate the `prev_event_id`. (This
# allows us to not have to pull out the existing state
# unnecessarily).
#
# The stream_id for the update is chosen to be the minimum of the stream_ids
# for the batch of the events that we are persisting; that means we do not
# end up in a situation where workers see events before the
# current_state_delta updates.
#
sql = """
INSERT INTO current_state_delta_stream
(stream_id, room_id, type, state_key, event_id, prev_event_id)
SELECT ?, ?, ?, ?, ?, (
SELECT event_id FROM current_state_events
WHERE room_id = ? AND type = ? AND state_key = ?
)
"""
txn.executemany(
sql,
(
(
stream_id,
room_id,
etype,
state_key,
None,
room_id,
etype,
state_key,
)
for etype, state_key in to_delete
# We sanity check that we're deleting rather than updating
if (etype, state_key) not in to_insert
),
)
txn.executemany(
sql,
(
(
stream_id,
room_id,
etype,
state_key,
ev_id,
room_id,
etype,
state_key,
)
for (etype, state_key), ev_id in iteritems(to_insert)
),
)
Possibly that inner SELECT
statement is returning two rows.
So this is due to current_state_events
having multiple rows with the same room_id, type, statekey
values, even though there is a unique index on that table for those rows. We don't currently know how the database ends up in this state.
To check if you have the same problem, you can run:
select count(*), room_id, type, state_key from current_state_events group by 2,3,4 order by count(*) DESC LIMIT 10;
And see if any of the numbers in the count
column are greater than 1. If so, you're affected.
To delete duplicate rows, I first took a backup and then ran:
DELETE FROM current_state_events a USING current_state_events b WHERE a.event_id < b.event_id AND a.room_id = b.room_id AND a.type = b.type AND a.state_key = b.state_key;
This deleted 518 rows for me.
Starting the server, things now work without the original exception. However the stats.notify_new_event
and user_directory.notify_new_event
background processes are now throwing exceptions when they run:
2020-01-15 14:14:06,055 - synapse.metrics.background_process_metrics - 214 - ERROR - user_directory.notify_new_event-4- Background process 'stats.notify_new_event' threw an exception
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/synapse/synapse/env/lib/python3.7/site-packages/synapse/metrics/background_process_metrics.py", line 212, in run
return (yield result)
File "/home/synapse/synapse/env/lib/python3.7/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/home/synapse/synapse/env/lib/python3.7/site-packages/twisted/python/failure.py", line 512, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/synapse/synapse/env/lib/python3.7/site-packages/synapse/handlers/stats.py", line 74, in process
yield self._unsafe_process()
File "/home/synapse/synapse/env/lib/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "/home/synapse/synapse/env/lib/python3.7/site-packages/synapse/handlers/stats.py", line 100, in _unsafe_process
self.pos, room_max_stream_ordering
File "/home/synapse/synapse/env/lib/python3.7/site-packages/synapse/storage/data_stores/main/state_deltas.py", line 51, in get_current_state_deltas
assert prev_stream_id <= max_stream_id
AssertionError
This may cause room stats and the user directory to stop updating. But otherwise things seem to be running fine.
Some more debugging information:
synapse=# select i.* from pg_class c join pg_index i on i.indexrelid=c.oid where relname='current_state_events_room_id_type_state_key_key';
indexrelid | indrelid | indnatts | indisunique | indisprimary | indisexclusion | indimmediate | indisclustered | indisvalid | indcheckxmin | indisready | indislive | indisreplident | indkey | indcollation | indclass | indoption | indexprs | indpred
------------+----------+----------+-------------+--------------+----------------+--------------+----------------+------------+--------------+------------+-----------+----------------+--------+--------------+----------------+-----------+----------+---------
163692 | 68784 | 3 | t | f | f | t | f | t | f | t | t | f | 2 3 4 | 100 100 100 | 3126 3126 3126 | 0 0 0 | |
(1 row)
Closing for now as I haven't seen any more notify_new_event
tracebacks except for the first time starting my server after deleting the rows. I'll keep my eyes peeled though.
It may also be a good idea to reindex your current_state_events
table:
REINDEX table current_state_events;
as recommended by https://www.postgresql.org/message-id/23007.1473604324%40sss.pgh.pa.us
I've had another report of this from @jomat, which I think makes three now. I'm going to try taking it to the postgres mailing list.
right, with the help of the postgres-general mailing list, I have an answer here:
As per https://wiki.postgresql.org/wiki/Locale_data_changes, glibc changed the sort order for UTF8 locales in version 2.28. This will mean postgres will be confused about where in an index a given entry should be stored.
Now, synapse should be safe, because the documentation says that you should set LC_COLLATE='C'
on the postgres database; however empirically not everybody has read that bit of the documentation.
In short, if \l
shows something other than C
for the Collate
value for the synapse database, and you have ever upgraded your database server from Debian 9 (stretch) to Debian 10 (buster), or from Ubuntu 18.04 (bionic) to Ubuntu 18 (cosmic), your indexes will be corrupted and you will need to recreate them.
I'm going to call this resolved, though as per #6722 we could do better at stopping people getting into this mess.
I try to dump working database with en_US.UTF-8 collation and import it back, but got several errors on creating indexes not only in current_state_events
table, but in other tables too:
ERROR: could not create unique index "current_state_events_room_id_type_state_key_key"
DETAIL: Key (room_id, type, state_key)=(!JoceisEAQRNMtpWVGm:matrix.org, m.room.member, @_oftc_haagch:matrix.org) is duplicated.
ERROR: could not create unique index "destinations_pkey"
DETAIL: Key (destination)=(matrix.org) is duplicated.
ERROR: could not create unique index "receipts_graph_uniqueness"
DETAIL: Key (room_id, receipt_type, user_id)=(!jJmpwspyTtbvTSfcck:matrix.org, m.read, @n-13:matrix.org) is duplicated.
ERROR: could not create unique index "receipts_linearized_uniqueness"
DETAIL: Key (room_id, receipt_type, user_id)=(!zXfJBqSUvXySmsZMtB:jki.re, m.read, @tomtau.:matrix.org) is duplicated.
ERROR: could not create unique index "device_lists_remote_cache_unique_id"
DETAIL: Key (user_id, device_id)=(@gauge:matrix.org, ABYXOHMFSM) is duplicated.
ERROR: could not create unique index "device_lists_remote_extremeties_unique_idx"
DETAIL: Key (user_id)=(@bewe_:gnuradio.org) is duplicated.
Maybe this was happens earlier, with older version of Synapse, or results of compression tool like this https://github.com/matrix-org/synapse/issues/4372
But solution is to remove duplicates. Here https://gist.github.com/MurzNN/fa8f3536be21426c55307761d8dcbc90 is SQL queries to remove duplicates.
Maybe this was happens earlier, with older version of Synapse
this is absolutely the sort of symptom you would expect from using a collation setting whose ordering has changed during the lifetime of the database. It is database-level corruption: an application (such Synapse or a compression tool) cannot cause it.
Maybe this was happens earlier, with older version of Synapse
this is absolutely the sort of symptom you would expect from using a collation setting whose ordering has changed during the lifetime of the database. It is database-level corruption: an application (such Synapse or a compression tool) cannot cause it.
I can't understand how this can happens even with wrong collation, if database already has unique index for fields? For example, in destinations
table I have found two rows with "destination=matrix.org".
But after manual cleaning-up duplicates, new ones seems not created again.
I can't understand how this can happens even with wrong collation,
https://wiki.postgresql.org/wiki/Locale_data_changes explains it. The problem is that strings may sort in a different order depending on your OS version, which means that postgres cannot reliably find entries in the index.
right, with the help of the postgres-general mailing list
For completeness, Rich's email thread was https://www.postgresql.org/message-id/flat/288488b7-d3b6-905b-3706-43f7a3c7bc52%40matrix.org
Description
I seem to be failing to handle a few PDUs sent to me due to an issue in federation_sender
Version information
Synapse v1.8.0
Edit: It turns out, when this happens, it's possible for the code that sends dummy events for fixing forward extremities to break. Thus, your server can build up forward extremities to the point where your server starts to churn and take up vast amounts of RAM and CPU.