matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.83k stars 2.13k forks source link

Strange dataloss after upgrade #16436

Closed XDjackieXD closed 1 year ago

XDjackieXD commented 1 year ago

Description

I upgraded my server where Postgres 13 and Synapse 1.83.0 was installed, to Postgres 15 and Synapse 1.93.0. The process was:

After that my element-web session was still logged in and seeing all conversations as I left off, but I could not interact (write or receive messages) from some of them. My Phone and element desktop were both logged out and upon logging back in a lot of the conversations/rooms were missing (those that were "broken" in the element-web session). All rooms where people from another server were involved could be recovered, by sending myself the "share room" link from the element-web session and clicking on it in the session on my phone but rooms where only my server was involved and got "lost" could not be recovered that way. Rejoining a room this way produces a "you made no changes" Event in the room only visible in the one client that was used to join the room again.

The loss of conversations/rooms seem to be based on creation time of the room and the time goes back roughly two years (dated based on when I remember joining some groups that still worked vs were broken).

It is not very tragic now but it was certainly very annoying and could be tragic on larger installs (I only have a small hand full of users that I know very well so guiding them through the "recovery" was not a huge deal). How can I debug this further and what can I do to prevent this in the future?

Steps to reproduce

Homeserver

chaosfield.at

Synapse Version

1.93.0

Installation Method

Other (please mention below)

Database

Single Postgresql, always was Postgres, never restored from backup.

Workers

Single process

Platform

Archlinux running as a Xen VM. Postgresql and Synapse directly installed from the repository.

Configuration

No response

Relevant log output

When accessing one of the unrecoverable rooms from the element web session that can still see them all:

2023-10-05 17:05:46,706 - synapse.access.http.8008 - 465 - INFO - GET-153 - <myipaddress> - 8008 - {@<myusername>:chaosfield.at} Processed request: 6.407sec/0.002sec (0.003sec, 0.000sec) (0.008sec/0.006sec/1) 736B 200 "GET /_matrix/client/v3/sync?filter=6&timeout=30000&since=<notsureifsecret> HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/118.0" [0 dbevts]
2023-10-05 17:05:46,785 - synapse.http.server - 124 - INFO - GET-190 - <XForwardedForRequest at 0x<notsureifsecret> method='GET' uri='/_matrix/client/v3/rooms/!<theroomid>%3Achaosfield.at/members?not_membership=leave&at=<notsureifsecret>' clientproto='HTTP/1.0' site='8008'> SynapseError: 404 - Can't find event for token StreamToken(room_key=RoomStreamToken(topological=None, stream=<notsureifsecret>, instance_map=immutabledict({})), presence_key=<notsureifsecret>, typing_key=0, receipt_key=<notsureifsecret>, account_data_key=<notsureifsecret>, push_rules_key=<notsureifsecret>, to_device_key=<notsureifsecret>, device_list_key=<notsureifsecret>, groups_key=0, un_partial_stated_rooms_key=<notsureifsecret>)
2023-10-05 17:05:46,803 - synapse.http.server - 124 - INFO - GET-189 - <XForwardedForRequest at 0x<notsureifsecret> method='GET' uri='/_matrix/client/v3/rooms/!<theroomid>%3Achaosfield.at/messages?limit=50&dir=b&from=<notsureifsecret>&filter=%7B%22lazy_load_members%22%3Atrue%7D' clientproto='HTTP/1.0' site='8008'> SynapseError: 403 - User @<myusername>:chaosfield.at not in room !<theroomid>:chaosfield.at, and room previews are disabled
2023-10-05 17:05:46,805 - synapse.access.http.8008 - 465 - INFO - GET-190 - <myipaddress> - 8008 - {@<myusername>:chaosfield.at} Processed request: 0.023sec/0.017sec (0.004sec, 0.000sec) (0.001sec/0.006sec/1) 355B 404 "GET /_matrix/client/v3/rooms/!<theroomid>%3Achaosfield.at/members?not_membership=leave&at=<notsureifsecret> HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/118.0" [0 dbevts]
2023-10-05 17:05:46,811 - synapse.access.http.8008 - 465 - INFO - PUT-184 - <myipaddress> - 8008 - {@<myusername>:chaosfield.at} Processed request: 0.139sec/0.019sec (0.008sec, 0.001sec) (0.016sec/0.083sec/3) 2B 200 "PUT /_matrix/client/v3/user/%40<myusername>%3Achaosfield.at/account_data/im.vector.setting.breadcrumbs HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/118.0" [0 dbevts]
2023-10-05 17:05:46,824 - synapse.access.http.8008 - 465 - INFO - GET-189 - <myipaddress> - 8008 - {@<myusername>:chaosfield.at} Processed request: 0.049sec/0.020sec (0.004sec, 0.003sec) (0.003sec/0.009sec/2) 139B 403 "GET /_matrix/client/v3/rooms/!<theroomid>%3Achaosfield.at/messages?limit=50&dir=b&from=<notsureifsecret>&filter=%7B%22lazy_load_members%22%3Atrue%7D HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/118.0" [0 dbevts]
2023-10-05 17:05:46,830 - synapse.access.http.8008 - 465 - INFO - GET-191 - <myipaddress> - 8008 - {@<myusername>:chaosfield.at} Processed request: 0.057sec/0.004sec (0.008sec, 0.000sec) (0.009sec/0.010sec/3) 1053B 200 "GET /_matrix/client/v3/sync?filter=6&timeout=30000&since=<notsureifsecret> HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.11.45 Chrome/114.0.5735.134 Electron/25.6.0 Safari/537.36" [0 dbevts]
2023-10-05 17:05:46,835 - synapse.access.http.8008 - 465 - INFO - GET-195 - <myipaddress> - 8008 - {@<myusername>:chaosfield.at} Processed request: 0.006sec/0.000sec (0.001sec, 0.000sec) (0.001sec/0.002sec/3) 1053B 200 "GET /_matrix/client/v3/sync?filter=6&timeout=30000&since=<notsureifsecret> HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/118.0" [0 dbevts]
2023-10-05 17:05:46,857 - synapse.http.server - 124 - INFO - GET-197 - <XForwardedForRequest at 0x<notsureifsecret> method='GET' uri='/_matrix/client/v3/rooms/!<theroomid>%3Achaosfield.at/messages?limit=50&dir=b&from=<notsureifsecret>&filter=%7B%22lazy_load_members%22%3Atrue%7D' clientproto='HTTP/1.0' site='8008'> SynapseError: 403 - User @<myusername>:chaosfield.at not in room !<theroomid>:chaosfield.at, and room previews are disabled
2023-10-05 17:05:46,857 - synapse.access.http.8008 - 465 - INFO - GET-197 - <myipaddress> - 8008 - {@<myusername>:chaosfield.at} Processed request: 0.002sec/0.000sec (0.001sec, 0.000sec) (0.000sec/0.001sec/2) 139B 403 "GET /_matrix/client/v3/rooms/!<theroomid>%3Achaosfield.at/messages?limit=50&dir=b&from=<notsureifsecret>&filter=%7B%22lazy_load_members%22%3Atrue%7D HTTP/1.0" "Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/118.0" [0 dbevts]

When trying to join the unrecoverable room from my android device:

2023-10-05 17:16:33,776 - synapse.access.http.8008 - 465 - INFO - GET-8186 - <myipaddress> - 8008 - {@<myusername>:chaosfield.at} Processed request: 0.083sec/0.001sec (0.006sec, 0.004sec) (0.005sec/0.055sec/2) 1570B 200 "GET /_matrix/client/r0/sync?filter=6&set_presence=online&timeout=0&since=<notsureifsecret> HTTP/1.0" "Element/1.6.5 (<androiddeviceinfo>)" [0 dbevts]
2023-10-05 17:16:34,421 - synapse.http.server - 124 - INFO - GET-8195 - <XForwardedForRequest at 0x<notsureifsecret> method='GET' uri='/_matrix/client/v3/rooms/!<theroomid>%3Achaosfield.at/messages?limit=50&dir=b&from=<notsureifsecret>&filter=%7B%22lazy_load_members%22%3Atrue%7D' clientproto='HTTP/1.0' site='8008'> SynapseError: 403 - User @<myusername>:chaosfield.at not in room !<theroomid>:chaosfield.at, and room previews are disabled
2023-10-05 17:16:34,734 - synapse.access.http.8008 - 465 - INFO - GET-8196 - <myipaddress> - 8008 - {@<myusername>:chaosfield.at} Processed request: 0.295sec/0.003sec (0.006sec, 0.000sec) (0.004sec/0.276sec/2) 710B 200 "GET /_matrix/client/r0/sync?filter=6&set_presence=online&timeout=0&since=<notsureifsecret> HTTP/1.0" "Element/1.6.5 (<androiddeviceinfo>)" [0 dbevts]
2023-10-05 17:16:34,791 - synapse.http.server - 124 - INFO - POST-8203 - <XForwardedForRequest at 0x<notsureifsecret> method='POST' uri='/_matrix/client/r0/join/!<theroomid>:chaosfield.at?server_name=chaosfield.at' clientproto='HTTP/1.0' site='8008'> SynapseError: 404 - Can't join remote room because no servers that are in the room have been provided.
2023-10-05 17:16:34,792 - synapse.access.http.8008 - 465 - INFO - POST-8203 - <myipaddress> - 8008 - {@<myusername>:chaosfield.at} Processed request: 0.006sec/0.000sec (0.001sec, 0.000sec) (0.001sec/0.002sec/6) 116B 404 "POST /_matrix/client/r0/join/!<theroomid>:chaosfield.at?server_name=chaosfield.at HTTP/1.0" "Element/1.6.5 (<androiddeviceinfo>)" [0 dbevts]

Anything else that would be useful to know?

No response

DMRobertson commented 1 year ago

Did the upgrade include some kind of glibc upgrade? This can cause problems if you're not using the C locale in postgres, see e.g. https://github.com/matrix-org/synapse/issues/6696#issuecomment-575280941

https://github.com/matrix-org/synapse/issues/11568 sounds similar too.

DMRobertson commented 1 year ago

Otherwise, if you have an example of a room that is still missing, we can give you some SQL commands to run directly in the database to diagnose the problem.

XDjackieXD commented 1 year ago

Thanks for the pointers! Yes the update included a glibc upgrade and the locale is not C for whatever reason (I found my initial commands in the shell history for creating the db and it was C so whatever I did in the last years of upgrading managed to accidentally changing the locale...)

So at least I know what to do to prevent this in the future ^^ Is there anything to do right now other than to change the locale? (especially in the direction of "is it likely that the corruption of the db now will cause issues in the future")

XDjackieXD commented 1 year ago

Going through the troubleshooting-sql-queries of the linked issues, the following select query returns results (all others listed here return zero hits):

# ERROR:  could not create unique index "receipts_linearized_uniqueness"
# DETAIL:  Key (room_id, receipt_type, user_id)=(!zXfJBqSUvXySmsZMtB:jki.re, m.read, @tomtau.:matrix.org) is duplicated.
SELECT
    room_id, receipt_type, user_id,
    COUNT( * )
FROM
    receipts_linearized
GROUP BY
    room_id, receipt_type, user_id
HAVING
    COUNT( * ) > 1
ORDER BY
    room_id, receipt_type, user_id;

(I have not run any of the delete queries yet)

DMRobertson commented 1 year ago

Is there anything to do right now other than to change the locale?

Not really. There are some docs on this, see https://matrix-org.github.io/synapse/latest/postgres.html#fixing-incorrect-collate-or-ctype

It's worth fixing this sooner rather than later---as you say, any database problems now will compound into the future.

DMRobertson commented 1 year ago

Going through the troubleshooting-sql-queries of the linked issues, the following select query returns results (all others listed here return zero hits):

I would avoid looking at that specific issue because Synapse's database schema has likely changed a lot since then; it's best to respond to your specific database directly. When you come to import into a new database, postgres will tell you about all the duplicate rows it sees.

XDjackieXD commented 1 year ago

Thanks for the help! I was able to dump, recreate and import and as anticipated there were some duplicates in the receipts_linearized table which I had to manually delete from the SQL dump (about 20 or so entries) before it would import completely.