matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.79k stars 2.13k forks source link

Synapse users lose access to all rooms after logging out and in again #11568

Closed ginkel closed 2 years ago

ginkel commented 2 years ago

Description

I was somewhat uncertain whether this may quailify as a support issue, but considering it involves catastrophic data loss, where the synapse instance is unable to locate rooms despite them being present in the database, I am inclined to open a bug report.

I am operating a Synapse (currently v1.48) instance along with an instance of mxisd. The trouble started when I migrated mxisd to ma1sd and - due to a misconfigured URL - Synapse could no longer reach ma1sd. This caused Element (Android) clients to be logged out.

Upon fixing the configuration error and logging in the client again, I was greeted with a room list that just contains one "Empty room" entry.

I do have a backup of the database and file system of the Matrix instance that pre-dates the configuration glitch.

Unfortunately, restoring it to that state does not remediate the situation, i.e., rooms are still gone.

I verified in the database that rooms are present in the rooms table and that the current_state_events table contains an m.room.create event for those rooms.

In the log I see messages like:

Exception: Room !QaxEOpfBhMxtIQrNBv:<redacted> for event $16386208139DWpYC:<redacted> is unknown

But what I wrote above about the room being present on the DB holds true for the room mentioned in the error message.

I can't rule out that the underlying problem has been lingering for a while and was only uncovered by the auth issues that logged out some clients. The Synapse instance is > 2 years old, so it has undergone all database migrations that occurred over time as the software evolved.

I'd appreciate your assistance to figure out what may be causing this and to eventually restore access for all users' rooms.

Thanks, Thilo

P.S.: Most channels were encrypted if that makes any difference

Steps to reproduce

I am afraid that I do not have clear clear instructions how to reproduce the problem except from what I wrote above.

Version information

If not matrix.org:

ginkel commented 2 years ago

Update: I have captured a pcap of the initial login process. The following requests are sent by the client:

image

Which request's response should contain the rooms of the user?

richvdh commented 2 years ago

Which request's response should contain the rooms of the user?

/sync.

I suggest you share your homeserver logs, including the full exception - it's unlikely we'll be able to shed much light without them.

ginkel commented 2 years ago

Thanks for your reply! Please find the server log at info level (w/ some IP addresses redacted) at: https://gist.github.com/ginkel/d954e86015f514df9a49a19dc0178806. The login operation (that ends up with a basically empty client) starts at time index 2021-12-13 23:04:55,879.

Please let me know if you need additional details / more verbose logs.

ginkel commented 2 years ago

Please find the server log at info level

I can easily create a log at DEBUG level, but I'd welcome any hint which sensitive information I'd need to remove...

DMRobertson commented 2 years ago

Thanks @ginkel. I'm surprised to read that restoring the DB from backup didn't resolve the problem.

Not much to go on in those logs, I'm afraid. No logging in the /sync responses, and they all ended up returning with status 200 OK. If possible, could you try logging out and in again on a client and capturing the server logs at DEBUG level? I'm particularly interested in the logs for the first call to /sync after a login (which should have no since= query parameter in the URL, or any sync that doesn't return with a 200 OK.

I can easily create a log at DEBUG level, but I'd welcome any hint which sensitive information I'd need to remove...

Don't have an authoritative answer for you here. You could try emailing them to me (davidr@element.io) or sending them over matrix (@dmrobertson:matrix.org), though I appreciate the latter is hard at the moment. I can see if there are any smoking guns.

DMRobertson commented 2 years ago

Exception: Room !QaxEOpfBhMxtIQrNBv:<redacted> for event $16386208139DWpYC:<redacted> is unknown

I also wanted to note that this looks like a room an event ID for an old room version (it contains a homeserver name). I wonder if that's causing pain somewhere?

ginkel commented 2 years ago

Thanks @ginkel. I'm surprised to read that restoring the DB from backup didn't resolve the problem.

Me too. ;-) I have run the Element Android client in a debugger and AFAICS room membership info gets synced to the client, but somehow has no effect:

image

I can easily create a log at DEBUG level, but I'd welcome any hint which sensitive information I'd need to remove...

Don't have an authoritative answer for you here. You could try emailing them to me (davidr@element.io) or sending them over matrix (@dmrobertson:matrix.org), though I appreciate the latter is hard at the moment. I can see if there are any smoking guns.

I sent you the debug log via Matrix.

ginkel commented 2 years ago

Exception: Room !QaxEOpfBhMxtIQrNBv:<redacted> for event $16386208139DWpYC:<redacted> is unknown

I also wanted to note that this looks like ~a room~ an event ID for an old room version (it contains a homeserver name). I wonder if that's causing pain somewhere?

That room is actually version 1. But v6 rooms (that's the newest I have) also have the homeserver name on the room id.

DMRobertson commented 2 years ago

I sent you the debug log via Matrix.

Thanks for this. I don't see an invite as such yet, perhaps your homeserver isn't federating due to this problem? I'll check again tomorrow, but it might be worth forwarding a copy over email too as a fallback.

ginkel commented 2 years ago

@DMRobertson Weird... Anyway, I just sent you a copy via email. Thanks for your support! :-)

richvdh commented 2 years ago

The logs you've sent to @DMRobertson might shed more light, but in the meantime it might be worth a quick sanity-check on your database. Could you try this query?

                SELECT room_id
                FROM current_state_events AS c
                INNER JOIN events AS e USING (room_id, event_id)
                WHERE
                    c.type = 'm.room.member'
                    AND c.state_key = '@tg:tgbyte.de'
                    AND c.membership = 'join';

It should return a list of rooms your server thinks your user is a member of.

ginkel commented 2 years ago

Interesting. That query comes back empty for my user as well as any other user I tried...

current_state_events contains join events for plenty of room / user combinations, though. It's as if the event IDs didn't match...

richvdh commented 2 years ago

iiinteresting. are you able to DM me? I'm @richvdh:sw1v.org.

ginkel commented 2 years ago

As it turned out, a database locale problem caused JOINs not to work correctly after the restore. This has now been fixed. Thanks for your help!

richvdh commented 2 years ago

related: https://github.com/matrix-org/synapse/issues/6696