matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.83k stars 2.12k forks source link

In one room: "User not in room" when sending messages, "Not a known room" when trying to leave #9666

Closed Ezwen closed 2 years ago

Ezwen commented 3 years ago

Description

Since an upgrade of Synapse to 1.30.0 (and a server restart), I encounter significant problems in a specific room (which before worked without any problem). Here are the symptoms:

Steps to reproduce

What I expect: the message is sent successfully What I get: the error message shown above

Version information

If not matrix.org:

Ezwen commented 3 years ago

When I try sending a message in said room, I have this message in my synapse logs:

synapse.handlers.message - 945 - WARNING - PUT-3537- Denying new event <FrozenEvent event_id='$161644359715JklbV:myserver.fr', type='m.room.message', state_key=None> b
ecause 403: User @XXX:YYY not in room !FhnTYZhdgIipdYUdar:otherserver.org (None)
Ezwen commented 3 years ago

I have checked using the synapse admin API, and my user account is definitely in the room.

erikjohnston commented 3 years ago

That is odd, seems like the server is confused about the state of the room. Can you check that you have rows for the room in the event_forward_extremities table please?

Ezwen commented 3 years ago

Here is what I have:

$ docker-compose exec db psql --user matrix_synapse -c "SELECT * FROM event_forward_extremities WHERE room_id='!FhnTYZhdgIipdYUdar:otherserver.org'"
            event_id             |              room_id
---------------------------------+-----------------------------------
 $16164587839FTJkr:otherserver.org | !FhnTYZhdgIipdYUdar:otherserver.org
(1 row)

(as in all my other messages, I've replaced the real server name by otherserver.org)

erikjohnston commented 3 years ago

Humm. Odd. If you do an API hit to /_matrix/client/r0/rooms/{roomId}/context/{eventId} with that room/event do you see your own membership in there?

The only thing I can think of is that somehow the servers view of the "current state" has diverged from the actual state of the room.

Ezwen commented 3 years ago

I obtain a rather big JSON file, with events and states, where should I look for my membership?

I can only find my user id (@XXX:YYY) in /state/0/content/users, where it seems all users with a certain power level are listed − this would be consistent with the fact that @XXX:YYY is an Administrator in this room.

erikjohnston commented 3 years ago

You should see an event with type: m.room.member and state_key: @XXX:YY with a content.membership: join, if not then that would explain why the server thinks you're not in the room.

Does _matrix/client/r0/rooms/{roomId}/state return a membership event for you?

Ezwen commented 3 years ago

You should see an event with type: m.room.member and state_key: @XXX:YY with a content.membership: join, if not then that would explain why the server thinks you're not in the room.

I see! Well indeed I cannot find any event matching the description you make here, so that would explain.

Does _matrix/client/r0/rooms/{roomId}/state return a membership event for you?

It gives me this result:

{"errcode":"M_FORBIDDEN","error":"User @XXX:YYY not in room !FhnTYZhdgIipdYUdar:otherserver.org, and room previews are disabled"}
erikjohnston commented 3 years ago

The plot thickens. Can you try doing:

SELECT * FROM current_state_events WHERE room_id = ?

and see if there are a) any rows and b) if the user ID appears and if so what the membership is

Ezwen commented 3 years ago

There are lots of rows, including all join events for all members of the room, and including a row with my own membership:

 $1572562716128RmfHF:myserver.fr    | !FhnTYZhdgIipdYUdar:otherserver.org | m.room.member                | @XXX:myserver.fr                 | join

(note: since the beginning I wrote both YYY and myserver.fr but both are in fact the same.)

erikjohnston commented 3 years ago

Ok, so seems like:

  1. The current state of the room thinks you're in the room (which should be the same as the state at the forward extremities)
  2. The state at the forward extremity doesn't think you're in the room
  3. Other servers think you're in the room

I'm slightly perplexed how 1 and 2 can both be true. It's possible to get out of sync with other servers due to state resets, as I note that its a v1 room.

I'm going to have to have a think about how to unwedge this.

Ezwen commented 3 years ago

@erikjohnston Thank you very much for taking all this time to investigate :). If there is anything more I can do to help, please do not hesitate to ask.

I suppose there is no easy workaround to manually fix this in the database?

erikjohnston commented 3 years ago

I suppose there is no easy workaround to manually fix this in the database?

Not off the top of my head, editing the current_state_events table would probably also require manually adding stuff to the current_state_deltas table. You could try having someone kick you from the room, which might work?

Ezwen commented 3 years ago

I suppose there is no easy workaround to manually fix this in the database?

Not off the top of my head, editing the current_state_events table would probably also require manually adding stuff to the current_state_deltas table. You could try having someone kick you from the room, which might work?

I've thought of doing that, but unfortunately I am an admin in this room, therefore un-kickable (at least AFAIK?) :(

Ezwen commented 3 years ago

Out of curiosity, is there any plan to someday add a synapse admin command to "fix inconsistencies in the database", such as this one? It seems to me it could prove useful in many situations.

Ezwen commented 3 years ago

Quick update: I just discovered that I have this very same problem (with the same same error messages) in another matrix room. Since I am not an admin in this second room, I have asked the admin to kick me and to reinvite me, and this did solve the problem for this room as you expected!

Now in my case the problem only remain in this room where I am admin and therefore un-kickable.

Ezwen commented 3 years ago

Second quick update: I found a third matrix room with a probably (?) related problem.

It is a 1-1 room that I hadn't used in a while, and I tried sending a message and I got this error in Element desktop: No create event in auth events. The other person, who is registered on the same homeserver as I, has the same error when sending a message in this room.

On synapse's side, I have errors that look like this:

synapse.handlers.message - 945 - WARNING - PUT-177186- Denying new event <FrozenEvent event_id='$1616850716377rkoye:myserver.fr', type='m.room.message', state_key=None
> because 403: No create event in auth events
erikjohnston commented 3 years ago

Thanks for the updates! Very odd that this has happened to multiple rooms now, though they're both v1 rooms so its possible that they're somehow being state reset (though it seems unlikely for a 1:1 room).

Out of curiosity, is there any plan to someday add a synapse admin command to "fix inconsistencies in the database", such as this one? It seems to me it could prove useful in many situations.

No, not really. In the general case it'd be hard/impossible to do, and take a very long time, but it would be plausible in this case.

I've thought of doing that, but unfortunately I am an admin in this room, therefore un-kickable (at least AFAIK?)

Has anyone else joined/left or otherwise changed state? I think that any state change might be enough to fix it, though possibly not.

Ezwen commented 3 years ago

Has anyone else joined/left or otherwise changed state? I think that any state change might be enough to fix it, though possibly not.

I've tried leaving/joining the room with a completely separate account, but unfortunately that did not change the situation for the first account.

What other kind of state change should I test?

Ezwen commented 3 years ago

Quick update about strange things that happened recently:

Additional information:

erikjohnston commented 3 years ago

Sorry for the delay in getting back with an update. This is an odd situation which is possibly due to state resets in room version 1(?). Unfortunately, we haven't had the time to do a deeper investigation into this. While the issue is concerning, we're not seeing reports from anyone else and so it hasn't made its way to the top of our priority list.

Next steps here would be to spend time looking at the state changes that are happening and generally what the room DAG is doing (on both servers) to try and see what is going on.

piegamesde commented 3 years ago

we're not seeing reports from anyone else

I do have the same issue :wave: cannot send messages, cannot leave. Room version is 6, and it was probably caused by some desync in the Gitter bridge (the broken room is a Gitter DM).

Ezwen commented 3 years ago

Sorry for the delay in getting back with an update. This is an odd situation which is possibly due to state resets in room version 1(?). Unfortunately, we haven't had the time to do a deeper investigation into this. While the issue is concerning, we're not seeing reports from anyone else and so it hasn't made its way to the top of our priority list.

No worries, I perfectly understand, I just wanted to keep you updated :)

Next steps here would be to spend time looking at the state changes that are happening and generally what the room DAG is doing (on both servers) to try and see what is going on.

Unfortunately I am personally unable to do that, but I'll remain subscribed to the issue thread in case there are updates.

onekopaka commented 3 years ago

I've seen the issue in a handful of rooms I administer that are on version 6, which have Half-Shot's discord bridge in them, as well as one room has the matrix-org IRC bridge.

My personal homeserver does not see all the users while the community one does see all users. Interestingly, after I try inviting users I cannot see from my personal homeserver, I will see their typing notifications but not their actual messages. I get similar entries in my logs as this comment both before and after attempting to invite them.

To try to resolve the issue, the two rooms I was having the most noticeable problems on, I tried leaving and rejoining and this did not fix the problem of not seeing those users. Looks like maybe I need to use the Delete Room Admin API to fully clear out my local view that is busted.

Additionally, some users on other homeservers have become unable to interact with the room, until I kick them and they rejoin. One user on another homeserver was completely unable to accept any invitation to one of these rooms, and they ended up creating an account on our community homeserver instead. After looking in the database, that homeserver seemed to have properly received the invitation event.

erikjohnston mentioned looking at state changes, what exactly should I be looking for?

richvdh commented 3 years ago

Thanks for reporting. Unfortunately diagnosing this is going to require detailed familiarity with the state storage schema. Given it's only affecting a small number of servers, I'm afraid it's unlikely we'll be able to find time to investigate in the near future.

svierne commented 2 years ago

I think I'm experiencing the same thing right now. :(

The Arch Linux channel got spammed by hundreds of bots a few days ago and since then I cannot leave the room nor send any messages. The behaviour is exactly as described in OP's first post. I can provide more details if needed.

Is there really no workaround I can try other than asking the Arch Linux staff to kick me? O_o

Thanks in advance for any update on this! (:

Edit: Actually, I cannot see any new messages posted by others.

Joecstarr commented 2 years ago

I was having similar behavior. I'm not positive exactly what solved it, but I'll try to document what I did.

The issue first appeared in a room ("clipboard") where I was the only user. I wasn't able to send, receive, leave, etc. Then for some reason it "spread" to the primary room my wife and I use. Neither of us could do anything in the room.

Same error messages/behavior mentioned above:

I tried to use the "dev tools" in element to force a room update. No luck. I tried to use an appservice to force a state change. No luck. I used the admin API to try to force a state change. No luck.

I gave up and was restoring an old dump of the database and noticed an error when importing the dump.

pg_restore: error: could not execute query: ERROR:  could not create unique index "room_account_data_uniqueness"
DETAIL:  Key (user_id, room_id, account_data_type)=(<userID>, <clipboard_roomID>, m.fully_read) is duplicated.
Command was: ALTER TABLE ONLY public.room_account_data
    ADD CONSTRAINT room_account_data_uniqueness UNIQUE (user_id, room_id, account_data_type);

So I went back to the "broken" database and deleted the extra row. Everything seems to work now, both the "main" room and my "clipboard" room.

The errors only started when I added a database backup script. I think that might be the root cause for me. I lost access to the "main" room around the time the backup script should run. Maybe there's a race condition somewhere.

Ezwen commented 2 years ago

While dealing with another issue https://github.com/matrix-org/synapse/issues/12549, I ended up having to use the synapse admin API to delete the problematic rooms with force_purge : true, like this:

curl -H "Authorization: Bearer <access token>" 'https://<homeserver>/_synapse/admin/v2/rooms/<room_id>' -X DELETE -d '{"force_purge": true}'

If the room was shared over federation, you can try joining it again, if it is allowed in the room's parameters.

richvdh commented 2 years ago

I think we can close this now. It turned out that @Ezwen had been suffering corruption in the underlying database (with errors from postgres such as missing chunk number 0 for toast value 143975 in pg_toast_16631, which will explain the erratic behaviour reported here.

ZhenyaPav commented 1 year ago

@richvdh Could you advise on what steps should be taken to fix the db corruption in such case? I'm having an identical issue right now.

skibbipl commented 1 year ago

Same here, any help would be appreciated.

ZhenyaPav commented 1 year ago

Same here, any help would be appreciated.

I think I solved the issue myself by removing my user from the room in the database, then purging it