matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.83k stars 2.12k forks source link

Rooms created without any membership (only `m.room.create` event) #15005

Open MadLittleMods opened 1 year ago

MadLittleMods commented 1 year ago

Description

Trying to tie up any loose-ends with the Gitter import as part of the Gitter migration process and noticing that we have some Matrix rooms recorded that seem to broken on the Synapse side. Some of these rooms only have a m.room.create event in it while others have no events which means we can't interact with it at all since permission denied errors everywhere (no membership or power-level events).

The only way we recorded this room on the Gitter side is if we got a 200-response from Synapse originally when it was created. If Synapse is returning with a error-free response, there should be much better guarantees on the room being created successfully (atomic transaction). Of course it's possible this could have been corrupted after the fact though. What most likely happened was the rooms were attempted to be created but timed out. Then we introduced code to reconcile the case where a Matrix room alias exists but we don't have the bridged Matrix room entry recorded, see https://github.com/matrix-org/synapse/issues/15005#issuecomment-1421481395

They all seem to be created on January 10th, 2023 which is probably in the middle of importing messages and the import scripts.

Here is the list of affected rooms that I know of:


In terms of moving Gitter forward, I think we can safely reconcile this by removing the bridged room entry in the Gitter database and the Matrix room alias and have them re-created (tracked by https://gitlab.com/gitterHQ/webapp/-/issues/2858#note_1267698515). But I'm unable to remove the Matrix room alias from the old room (M_FORBIDDEN) 😬

And it would still be good to fix the Synapse bugs that allow this to happen.

Steps to reproduce

  1. Not sure how this happened but it's all Gitter appservice interactions against Synapse

Homeserver

gitter.im

Synapse Version

1.74.0

Installation Method

I don't know

Database

PostgreSQL

Workers

Multiple workers

Platform

gitter.im (gitter.ems.host) runs on EMS but I don't know the details

Python version: 3.11.1

Configuration

No response

Relevant log output

The only relevant logs for this room from the time of creation is this message:

https://modular-euwest2-kibana.proxy.matrix.org/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:now-8w,to:now-3w))&_a=(columns:!(_source),index:'67745970-aa2d-11ea-96a7-efc84863d0ea',interval:auto,query:(language:kuery,query:'kubernetes.pod.labels.hostname:%22gitter.ems.host%22%20AND%20message:%22!!gHbYoGxiEuXmahoYeh:gitter.im%22'),sort:!(!('@timestamp',desc)))

2023-01-09 17:33:53,203 - synapse.replication.http.send_events - 149 - INFO - POST-4249- Got batch of events to send, last ID of batch is: $JeDmcg6bYb0s-d6pxIbgmfPaJoChH-qWlnTZ3n8xGdU, sending into room: !gHbYoGxiEuXmahoYeh:gitter.im

Database information:

# Connect to the database: https://gitlab.matrix.org/new-vector/internal/-/wikis/EMS-postgres#connecting-to-an-ems-database
$ ems-db-debug gitter.im

select * FROM rooms WHERE room_id = '!gHbYoGxiEuXmahoYeh:gitter.im';
            room_id            | is_public |         creator          | room_version | has_auth_chain_index
-------------------------------+-----------+--------------------------+--------------+----------------------
 !gHbYoGxiEuXmahoYeh:gitter.im | t         | @gitter-badger:gitter.im | 9            | t

select * FROM room_memberships WHERE room_id = '!gHbYoGxiEuXmahoYeh:gitter.im';
 event_id | user_id | sender | room_id | membership | forgotten | display_name | avatar_url
----------+---------+--------+---------+------------+-----------+--------------+------------
(0 rows)

select * FROM events WHERE room_id = '!gHbYoGxiEuXmahoYeh:gitter.im';
 topological_ordering |                   event_id                   |     type      |            room_id            | content | unrecognized_keys | processed | outlier | depth | origin_server_ts |  received_ts  |
   sender          | contains_url |       instance_name       | stream_ordering | state_key | rejection_reason
----------------------+----------------------------------------------+---------------+-------------------------------+---------+-------------------+-----------+---------+-------+------------------+---------------+-------
-------------------+--------------+---------------------------+-----------------+-----------+------------------
                    1 | $JeDmcg6bYb0s-d6pxIbgmfPaJoChH-qWlnTZ3n8xGdU | m.room.create | !gHbYoGxiEuXmahoYeh:gitter.im |         |                   | t         | f       |     1 |    1673285632772 | 1673285633224 | @gitte
r-badger:gitter.im | f            | synapse-event-persister-3 |        94262741 |           |
(1 row)

select * FROM room_aliases WHERE room_id = '!gHbYoGxiEuXmahoYeh:gitter.im';
            room_alias            |            room_id            |         creator
----------------------------------+-------------------------------+--------------------------
 #GreenShoeGarage_Lobby:gitter.im | !gHbYoGxiEuXmahoYeh:gitter.im | @gitter-badger:gitter.im
(1 row)

Anything else that would be useful to know?

No response

clokep commented 1 year ago

This is due to us creating the room not all in one transaction, e.g. the alias is created here:

https://github.com/matrix-org/synapse/blob/2b69fe7320a7ff66c3866979cb6f16754ad29ce9/synapse/handlers/room.py#L887-L895

But if anything fails after that point it will still exist.

clokep commented 1 year ago

In order to clean-up this up safely I ended up using the purge room API with {"erase": true}.

MadLittleMods commented 1 year ago

@clokep Why does the create room endpoint return 200 ✅ before the room is totally created? (worker hand-off with no checking?)

clokep commented 1 year ago

Why does the create room endpoint return 200 ✅ before the room is totally created? (worker hand-off with no checking?)

I haven't found a code-path that could do this, are we sure we got a 200 response and not a timeout or something?

It looks like the various worker hand-offs are all awaited properly, but maybe an error is being swallowed here:

https://github.com/matrix-org/synapse/blob/805b641fb6b31e677278eaf6e27875eba5c2a3d3/synapse/handlers/message.py#L1430-L1449

Or maybe something weird is happening where we don't eject things from the pre-cache if the persisting fails?

clokep commented 1 year ago

I don't see other open issues that /createRoom isn't idempotent.

DMRobertson commented 1 year ago

I don't see other open issues that /createRoom isn't idempotent.

It can take a transaction id and be PUT in Synapse. I think this is unspecced. There's another issue somewhere for this, let me find it

https://github.com/matrix-org/synapse/blob/1182ae50635db94d3c9c47990a0befcbf6306b62/synapse/rest/client/room.py#L154-L158

DMRobertson commented 1 year ago

There's another issue somewhere for this, let me find it

https://github.com/matrix-org/synapse/issues/14535

DMRobertson commented 1 year ago

And also https://github.com/matrix-org/matrix-spec/issues/222

MadLittleMods commented 1 year ago

I haven't found a code-path that could do this, are we sure we got a 200 response and not a timeout or something?

Good call!

It's possible that some recent code filled this in. We introduced some code to resolve a conflict where the Matrix alias exists but we don't have it stored as a bridged room entry yet: https://gitlab.com/gitterHQ/webapp/-/merge_requests/2357 (2023-01-21)

This looks very likely when I extract the date from the bridged room entries in the database since the dates come after that PR:

const mongoUtils = require('gitter-web-persistence-utils/lib/mongo-utils');

const bridgedRoomEntryIds = ['63e17d0e6da0373984be0760', '63ce5f4e6da0373984bd6166', '63ce57ba6da0373984bd612b', '63ce75b26da0373984bd6257', '63cca9586da0373984bd5910', '63ce771a6da0373984bd625a', '63ce59116da0373984bd6137', '63ccaf3a6da0373984bd591d', '63ce591d6da0373984bd6138', '63ce77586da0373984bd625b', '63ce697b6da0373984bd61c1', '63ce78886da0373984bd6264', '63ce79576da0373984bd6265', '63ce6acb6da0373984bd61c5', '63ce5e1f6da0373984bd6158', '63ccb4266da0373984bd5927', '63ce72496da0373984bd622b', '63ce92246da0373984bd6383', '63ccc1716da0373984bd593c'];

bridgedRoomEntryIds.map((bridgedRoomEntryId) => {
  return mongoUtils.getDateFromObjectId(bridgedRoomEntryId);
});

[
  2023-02-06T22:19:58.000Z,
  2023-01-23T10:19:58.000Z,
  2023-01-23T09:47:38.000Z,
  2023-01-23T11:55:30.000Z,
  2023-01-22T03:11:20.000Z,
  2023-01-23T12:01:30.000Z,
  2023-01-23T09:53:21.000Z,
  2023-01-22T03:36:26.000Z,
  2023-01-23T09:53:33.000Z,
  2023-01-23T12:02:32.000Z,
  2023-01-23T11:03:23.000Z,
  2023-01-23T12:07:36.000Z,
  2023-01-23T12:11:03.000Z,
  2023-01-23T11:08:59.000Z,
  2023-01-23T10:14:55.000Z,
  2023-01-22T03:57:26.000Z,
  2023-01-23T11:40:57.000Z,
  2023-01-23T13:56:52.000Z,
  2023-01-22T04:54:09.000Z
]
db.matricesbridgedroom (bridged room entries) of problem rooms ``` [ { "_id" : ObjectId("63e17d0e6da0373984be0760"), "matrixRoomId" : "!gHbYoGxiEuXmahoYeh:gitter.im", "troupeId" : ObjectId("59a20334d73408ce4f731f30"), "__v" : 0 }, { "_id" : ObjectId("63ce5f4e6da0373984bd6166"), "matrixRoomId" : "!OpjflmXuauidodzmJH:gitter.im", "troupeId" : ObjectId("594d0c86d73408ce4f698f0f"), "__v" : 0 }, { "_id" : ObjectId("63ce57ba6da0373984bd612b"), "matrixRoomId" : "!QCVddUkjrHLjEqeXQd:gitter.im", "troupeId" : ObjectId("5a420ea4d73408ce4f84d8ef"), "__v" : 0 }, { "_id" : ObjectId("63ce75b26da0373984bd6257"), "matrixRoomId" : "!wJBkwhqwLssHVjKErI:gitter.im", "troupeId" : ObjectId("5a425341d73408ce4f84df73"), "__v" : 0 }, { "_id" : ObjectId("63cca9586da0373984bd5910"), "matrixRoomId" : "!kkjacXiigblOffltrw:gitter.im", "troupeId" : ObjectId("5a426551d73408ce4f84e18b"), "__v" : 0 }, { "_id" : ObjectId("63ce771a6da0373984bd625a"), "matrixRoomId" : "!etJKlYVRXEtkkMdEOJ:gitter.im", "troupeId" : ObjectId("5a68c628d73408ce4f89f2f6"), "__v" : 0 }, { "_id" : ObjectId("63ce59116da0373984bd6137"), "matrixRoomId" : "!VeVZEnGasXJXpAYECh:gitter.im", "troupeId" : ObjectId("5a690eecd73408ce4f89fcee"), "__v" : 0 }, { "_id" : ObjectId("63ccaf3a6da0373984bd591d"), "matrixRoomId" : "!iJIbqcgSzsAFnwjVBI:gitter.im", "troupeId" : ObjectId("5a6a3541d73408ce4f8a25f9"), "__v" : 0 }, { "_id" : ObjectId("63ce591d6da0373984bd6138"), "matrixRoomId" : "!koFtykttKGRpjbzflq:gitter.im", "troupeId" : ObjectId("5a6a5b82d73408ce4f8a2aed"), "__v" : 0 }, { "_id" : ObjectId("63ce77586da0373984bd625b"), "matrixRoomId" : "!mxfIojpZcyQDBbmmoY:gitter.im", "troupeId" : ObjectId("5a6e1afad73408ce4f8a886f"), "__v" : 0 }, { "_id" : ObjectId("63ce697b6da0373984bd61c1"), "matrixRoomId" : "!ezxePTXYHMSIkpGOiy:gitter.im", "troupeId" : ObjectId("5a7d2122d73408ce4f8c5b93"), "__v" : 0 }, { "_id" : ObjectId("63ce78886da0373984bd6264"), "matrixRoomId" : "!QlEqqNHuTSIHswllOX:gitter.im", "troupeId" : ObjectId("5a8ec7e7d73408ce4f8e56a2"), "__v" : 0 }, { "_id" : ObjectId("63ce79576da0373984bd6265"), "matrixRoomId" : "!XInKHVcDAmmyGBEUwr:gitter.im", "troupeId" : ObjectId("5aa5a2fad73408ce4f90e688"), "__v" : 0 }, { "_id" : ObjectId("63ce6acb6da0373984bd61c5"), "matrixRoomId" : "!KIKPFcoDnhBeZhLNUr:gitter.im", "troupeId" : ObjectId("5aa61f1dd73408ce4f90eff0"), "__v" : 0 }, { "_id" : ObjectId("63ce5e1f6da0373984bd6158"), "matrixRoomId" : "!BcBAOZpxjTAvMDfSHf:gitter.im", "troupeId" : ObjectId("5b0e6243d73408ce4f9b3833"), "__v" : 0 }, { "_id" : ObjectId("63ccb4266da0373984bd5927"), "matrixRoomId" : "!DuyTBdNdlhfXoBFNqr:gitter.im", "troupeId" : ObjectId("5b4d9450d73408ce4fa126e6"), "__v" : 0 }, { "_id" : ObjectId("63ce72496da0373984bd622b"), "matrixRoomId" : "!qgfkmcpByUudrFHTMX:gitter.im", "troupeId" : ObjectId("5bbf3ca6d73408ce4fab1b64"), "__v" : 0 }, { "_id" : ObjectId("63ce92246da0373984bd6383"), "matrixRoomId" : "!tFwVYIoUNZQhRhjFNO:gitter.im", "troupeId" : ObjectId("618418896da037398489cf1f"), "__v" : 0 }, { "_id" : ObjectId("63ccc1716da0373984bd593c"), "matrixRoomId" : "!qoZZIEjvblGKIkqEdj:gitter.im", "troupeId" : ObjectId("63a5beea6da0373984a14ba5"), "__v" : 0 } ] ```

Otherwise, normally I think it would have to be a 200 since we would need to receive a 200 with the room_id returned. But the above situation looks more likely 🕵️‍♀️