matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.
https://matrix-org.github.io/synapse
Apache License 2.0
11.82k stars 2.13k forks source link

Event got sent but not synced back down everywhere #9424

Closed bwindels closed 2 years ago

bwindels commented 3 years ago

The event in question is https://app.element.io/#/room/!RbKRuLTvHIhArQcvjr:matrix.org/$1613561441289661rpDpF:matrix.org?via=matrix.org&via=vector.modular.im&via=t2l.io

The client that sent the event (element desktop) and my dekstop hydrogen running alongside did not get the event from sync.

On element desktop the event was black (e.g. sent) but stuck to the bottom of the room, and on hydrogen I check the storage and it showed that there was no limited sync for that room around those messages, and there is no trace of that event anywhere.

When opening my phone hydrogen an hour or so later, the event was synced down to it in the correct order.

At least T3chguy and Ryan did receive the event in the correct order.

BillCarsonFr commented 3 years ago

we have also several report of that on android https://github.com/matrix-org/element-android-rageshakes/issues/20724 https://github.com/matrix-org/element-android-rageshakes/issues/20672 https://github.com/matrix-org/element-android-rageshakes/issues/20578

I also experienced it on other client (web)

Also this issue https://github.com/vector-im/element-android/issues/516

BillCarsonFr commented 3 years ago

https://github.com/matrix-org/synapse/issues/9624

bwindels commented 3 years ago

@callahad @anoadragon453 I renamed this ticket as #9624 (which was marked as a dupe) is about the recipient and not the sender not receiving the message. FYI if this changes the severity of the bug.

callahad commented 3 years ago

Just an update: This is on our radar, but we expect it to be a pretty big yak to shave, so it's currently deferred until we can wrap up some other hairy problems 🐃

richvdh commented 3 years ago

Matthew reported another instance of this today. Link and details for internal users at https://github.com/matrix-org/element-web-rageshakes/issues/5808

anoadragon453 commented 3 years ago

And another set by Matthew on Aug 10:

Room ID: !JjwYSGirOwIXEpJgzg:matrix.org
Event ID: $ZrcrvSAESv5PBzszSXpT5EahA2eQR-raFbFfSY314LQ
erikjohnston commented 3 years ago

We should add some more logs to try and track down where the event is going missing. I'd probably do something like:

etc. The idea being that hopefully we'll be able to see if the missing event is pulled out of the DB and if so where we drop it (if we ever do).

ara4n commented 3 years ago

And another just now...


Room ID: !KzalCNJxkqtytlbQvX:matrix.org
Event ID: $16310279891219482qXcHE:matrix.org
ara4n commented 3 years ago

...and again...


Room ID: !KzalCNJxkqtytlbQvX:matrix.org
Event ID: $16311860683330067SxAox:matrix.org
ara4n commented 3 years ago

...and again...


Room ID: !KzalCNJxkqtytlbQvX:matrix.org
Event ID: $16322575506214369TPVVU:matrix.org
ara4n commented 3 years ago

...and again...


Room ID: !rZUxrMFEsvfBNGpmDf:matrix.org
Event ID: $1633697134169238TSzOW:matrix.org
erikjohnston commented 3 years ago

The event in question:

matrix=> select topological_ordering, instance_name, stream_ordering, to_timestamp(origin_server_ts/1000) from events where event_id = '$1633697134169238TSzOW:matrix.org';
 topological_ordering |   instance_name   | stream_ordering |      to_timestamp      
----------------------+-------------------+-----------------+------------------------
                 3229 | event_persister-3 |      2372403105 | 2021-10-08 13:45:34+01
(1 row)

The trace for the sync that should have returned it (will disappear soon): https://jaeger.int.matrix.org/trace/901206efde7b4ac6, which happened on synchrotron-36.

Selection_054

Selection_056 Selection_057

The /sync gets woken up twice, the first with a since and now token of: RoomStreamToken(topological=None, stream=2372403104, instance_map=<frozendict {'event_persister-2': 2372403109, 'event_persister-1': 2372403109}>) RoomStreamToken(topological=None, stream=2372403115, instance_map=<frozendict {}>) this should have picked up the sent event, as it was sent on event_persister-3 with stream ordering 2372403105. However, it decides that no rooms have changed: rooms_have_changed: false. The subsequent time the sync gets woken up the token are: RoomStreamToken(topological=None, stream=2372403104, instance_map=<frozendict {'event_persister-2': 2372403109, 'event_persister-1': 2372403109}>) RoomStreamToken(topological=None, stream=2372403130, instance_map=<frozendict {}>), which also finds that no rooms have changed.

The RoomStreamToken and PersistedEventPosition work correctly with these values:

In [1]: from synapse.types import RoomStreamToken, PersistedEventPosition                                                                                                                                         

In [2]: from frozendict import frozendict                                                                                                                                                                         

In [3]: since = RoomStreamToken(None, 2372403104, frozendict({'event_persister-2': 2372403109, 'event_persister-1': 2372403109}))                                                                                 

In [4]: pos = PersistedEventPosition("event_persister-3", 2372403105)                                                                                                                                             

In [5]: pos.persisted_after(since)                                                                                                                                                                                
Out[5]: True

We see in the synchtron36 logs:

2021-10-08 12:45:34,307 - synapse.replication.tcp.handler - 509 - INFO - replication_command_handler-140621334838808 - Handling 'POSITION events event_persister-3 2372403104 2372403108'

which is worrying, as event_persister-3 did write the event in the range 2372403104..2372403108. The event persister should not have sent out a POSITION with that range, as it means "I did nothing in this range".

I think there is a race between workers sending out the appropriate RDATA and us sending out the periodic POSITION.

erikjohnston commented 3 years ago

On event persister 3:

2021-10-08 12:45:34,150 - synapse.replication.tcp.resource - 188 - INFO - replication_notifier-6533172 - Streaming: events -> 2372403104
...
2021-10-08 12:45:34,298 - synapse.replication.tcp.resource - 204 - INFO - replication_notifier-6533172 - Sending position: events -> 2372403108
...
2021-10-08 12:45:34,330 - synapse.access.http.18129 - 421 - INFO - POST-3580545 - ::ffff:10.103.0.1 - 18129 - {@matthew:matrix.org} Processed request: 0.183sec/0.009sec (0.003sec, 0.000sec) (0.000sec/0.000sec/0) 71B 200 "POST /_synapse/replication/send_event/%241633697134169238TSzOW%3Amatrix.org/shpMywvFLw HTTP/1.1" "Synapse/1.44.0rc2 (b=matrix-org-hotfixes,ebbd37b66)" [0 dbevts]

...
2021-10-08 12:45:34,366 - synapse.replication.tcp.resource - 188 - INFO - replication_notifier-6533173 - Streaming: events -> 2372403114

which looks to confirm that we're spuriously sending out a POSITION instead of the event in question.

c.f. https://github.com/matrix-org/synapse/blob/51a5da74ccd383806378b53ee8a09e27a8829f31/synapse/replication/tcp/resource.py#L186-L215

erikjohnston commented 3 years ago

Fetching the updates doesn't appear to be hitting any caches, and just does a query with a WHERE ? < stream_ordering <= ? clause, so I don't think its caching.

The query in question returns the row now:

SELECT e.stream_ordering, e.event_id, e.room_id, e.type,
                 state_key, redacts, relates_to_id, membership, rejections.reason IS NOT NULL
                 FROM events AS e
                 LEFT JOIN redactions USING (event_id)
                 LEFT JOIN state_events USING (event_id)
                 LEFT JOIN event_relations USING (event_id)
                 LEFT JOIN room_memberships USING (event_id)
                 LEFT JOIN rejections USING (event_id)
                 WHERE 2372403104 < stream_ordering AND stream_ordering <= 2372403108
                 AND instance_name = 'event_persister-3'
                 ORDER BY stream_ordering ASC
                 LIMIT 100;

My current best guess is that there is a bug in MultiWriterIdGenerator which is causing it to return a current token for the local instance before all previous tokens have finished persisting.

ara4n commented 3 years ago

...and another by @hughns this morning (taken from his rageshake):

2021-10-11T10:33:35.753Z I Event sent to !ZiIcIhEEPvPrkUeheO:matrix.org with event id $16339484157934361ksrnd:matrix.org
2021-10-11T10:33:35.753Z I setting pendingEvent status to sent in !ZiIcIhEEPvPrkUeheO:matrix.org event ID ~!ZiIcIhEEPvPrkUeheO:matrix.org:m1633948414939.743 -> $16339484157934361ksrnd:matrix.org

...and the remote echo never comes.

erikjohnston commented 3 years ago

Worth noting that the stream ID sent by the POSITION 2372403108 was written by event_persister-2, and gets replicated to event_persister-3 at:

2021-10-08 12:45:34,265 - synapse.replication.tcp.handler - 509 - INFO - replication_command_handler-139816608471832 - Handling 'POSITION events event_persister-2 2372403107 2372403108'

Which is before the missing event is finished persisting (it's received_ts is 12:45:34,280). This makes it look very much like the event stream ID generator is incorrectly returning a current token of 2372403108 even though 2372403105 hasn't been persisted yet.

For reference, the code where we update the current position of the local instance when we receive updates from remote instances is:

https://github.com/matrix-org/synapse/blob/a7d22c36dbbbdd396aeb8938b57b5fd7edb689f3/synapse/storage/util/id_generators.py#L603-L611

erikjohnston commented 3 years ago

Turns out there is a race between the ID generator being allocated stream IDs and adding them to its list of unpersisted stream IDs, which meant that it could incorrectly advance the current position too far. #11045 should fix it.

ara4n commented 3 years ago

...and one from Amandine...

!KzalCNJxkqtytlbQvX:matrix.org
$16339912738468991gJzQg:matrix.org
erikjohnston commented 3 years ago

I've deployed a change to matrix.org that should hopefully fix this. Let me know if it happens again

ara4n commented 3 years ago

it happened again:


Room ID: !FuBNfHZXzpaxULYtAd:matrix.org
Event ID: ~!FuBNfHZXzpaxULYtAd:matrix.org:m1634134221985.8790
ara4n commented 3 years ago

actually, this looks like an e2ee client bug

MatMaul commented 2 years ago

@ara4n what drove you to an e2ee client bug ? do you have a bug open somewhere ?

Asking because I have a similar case. Doing a clear cache gave us back the message, but undecryptable because of duplicate message index, possible replay attack. Other people received the message just fine.

MatMaul commented 2 years ago
Room ID: !xjfBBlIpIRdmnzhWYJ:matrix.org
Event ID: $PcE5p8JizhZHUkjh7YDuWf-ktT30VhbsAaFZr3dvUX0
Device ID: VCTTPNSSYZ
reivilibre commented 2 years ago

Going to re-open until we track down the above. Will probably want to look through the logs before they get rotated out (the event happened on Monday I believe it was said) — going to raise at triage meeting tomorrow because I'm not clear on what I'm looking for.

reivilibre commented 2 years ago

@MatMaul As an update, it's unlikely we can debug the issue you had on Monday because most of the interesting information is in Jaeger, which people like Matthew are set up to trigger every time, but other people something like 1 in 50 requests. To troubleshoot this, we would probably want a sequence of 3 /syncs, which would give you a (1/50)³ or 1/125000 chance of having hit the right conditions to be sampled by Jaeger. To my understanding, with your consent, this could also be enabled for you if desired, in case the bug happens again in the future (I'm not personally sure what the considerations are, but you can ask one of us in the backend room of course...).

It would still be interesting to know from @ara4n why the previous issue was thought to be a client E2EE bug and not this issue (to see if it's the same for your case).

bmarty commented 2 years ago

FTR We implemented a workaround in https://github.com/vector-im/element-android/pull/4928, which will be available in Element Android 1.3.16 (next release)

bradtgmurray commented 2 years ago

I think maybe I just ran into this?

Newly bridged room, events in the room that looks like this.

topological_ordering event_id type room_id content unrecognized_keys processed outlier depth origin_server_ts received_ts sender contains_url instance_name stream_ordering state_key rejection_reason
1 $p9Abgg9tv02f6tKODR7wIy0i7qV_IHti4_C998jbQVE m.room.create !vXpiDigZlPILPsLRdK:beeper-dev.com NULL NULL Yes No 1 1646688871373 1646688871414 @_adam203072022_whatsapp_xxx:beeper-dev.com No synapse-event-persister-1 2491927 NULL
2 $XP3flh1UMzMlCsksvaUy5QqWA3B5zZStw7feahPSUpo m.room.member !vXpiDigZlPILPsLRdK:beeper-dev.com NULL NULL Yes No 2 1646688871490 1646688871537 @_adam203072022_whatsapp_xxx:beeper-dev.com No synapse-event-persister-1 2491928 @_adam203072022_whatsapp_xxx:beeper-dev.com NULL
3 $TM35mHbuMlFmWiospfJm9m8ABaVNAORQ2_jdCaTkJBI m.room.power_levels !vXpiDigZlPILPsLRdK:beeper-dev.com NULL NULL Yes No 3 1646688871610 1646688871659 @_adam203072022_whatsapp_xxx:beeper-dev.com No synapse-event-persister-1 2491929 NULL
4 $CCkySQxH_4sjBfTsJxFggOMiwQ3C3yhQkNk2T3iQjvI m.room.join_rules !vXpiDigZlPILPsLRdK:beeper-dev.com NULL NULL Yes No 4 1646688871718 1646688871763 @_adam203072022_whatsapp_xxx:beeper-dev.com No synapse-event-persister-1 2491930 NULL
5 $4UzYP-X6ed6JYls1e47wOWEo1MblN1sWKAoIJS6NnjI m.room.history_visibility !vXpiDigZlPILPsLRdK:beeper-dev.com NULL NULL Yes No 5 1646688871825 1646688871873 @_adam203072022_whatsapp_xxx:beeper-dev.com No synapse-event-persister-1 2491931 NULL
6 $E9MoYI7V9WMas1fL0DhPdL3xM_tCVLAmAlqxQhgSVeU m.room.guest_access !vXpiDigZlPILPsLRdK:beeper-dev.com NULL NULL Yes No 6 1646688871957 1646688871997 @_adam203072022_whatsapp_xxx:beeper-dev.com No synapse-event-persister-1 2491932 NULL
7 $1p9qznXfxDpj-yyPYSf2_sFRWxLpIBJVDU79YGyrtTk m.bridge !vXpiDigZlPILPsLRdK:beeper-dev.com NULL NULL Yes No 7 1646688872072 1646688872101 @_adam203072022_whatsapp_xxx:beeper-dev.com No synapse-event-persister-1 2491933 net.maunium.whatsapp://whatsapp/xxx@s.whatsapp.net NULL
8 $HDGxoG8PotKnEs92lm4y4z_ecaQ_Lr41-SAn9PDZJJo uk.half-shot.bridge !vXpiDigZlPILPsLRdK:beeper-dev.com NULL NULL Yes No 8 1646688872151 1646688872192 @_adam203072022_whatsapp_xxx:beeper-dev.com No synapse-event-persister-1 2491934 net.maunium.whatsapp://whatsapp/xxx@s.whatsapp.net NULL
9 $LmWV4ZraOZ9HUhL8ZE5Rsl9_GsKiFb8EEuq7jq73MTA m.room.avatar !vXpiDigZlPILPsLRdK:beeper-dev.com NULL NULL Yes No 9 1646688872256 1646688872304 @_adam203072022_whatsapp_xxx:beeper-dev.com Yes synapse-event-persister-1 2491935 NULL
10 $BGiiLu-iIpuo8I6zthh-l1QlqIZDL-Sq52ahd_KK4tM m.room.encryption !vXpiDigZlPILPsLRdK:beeper-dev.com NULL NULL Yes No 10 1646688872355 1646688872392 @_adam203072022_whatsapp_xxx:beeper-dev.com No synapse-event-persister-1 2491936 NULL
11 $J9tLCWLA76Yq0TFw7V-S0hsbG8znHwA6we1jZy6nuf4 http://m.room.name/ !vXpiDigZlPILPsLRdK:beeper-dev.com NULL NULL Yes No 11 1646688872445 1646688872476 @_adam203072022_whatsapp_xxx:beeper-dev.com No synapse-event-persister-1 2491937 NULL
12 $GzdqlpNsqZ09ZsMmJpz3n4zqmqScc-Yrt6ErzpN7858 m.room.topic !vXpiDigZlPILPsLRdK:beeper-dev.com NULL NULL Yes No 12 1646688872511 1646688872541 @_adam203072022_whatsapp_xxx:beeper-dev.com No synapse-event-persister-1 2491938 NULL
13 $iUd_bYnsP33IHeBz0q2tTxZT37vtqWpklKo53lr1MBw m.room.member !vXpiDigZlPILPsLRdK:beeper-dev.com NULL NULL Yes No 13 1646688872593 1646688872654 @_adam203072022_whatsapp_xxx:beeper-dev.com No synapse-event-persister-1 2491939 @_adam203072022_whatsapp_bot:beeper-dev.com NULL
14 $NbNAmJmGCOQ-fJV9g16kLFOYdC0E9G46XkDsowGoxRE m.room.member !vXpiDigZlPILPsLRdK:beeper-dev.com NULL NULL Yes No 14 1646688872941 1646688872978 @_adam203072022_whatsapp_xxx:beeper-dev.com No synapse-event-persister-1 2491940 @adam203072022:beeper-dev.com NULL
15 $VDdaVrwPRIOdFaafGPPSR242zVGXee2agRtRjZNGbXE m.room.member !vXpiDigZlPILPsLRdK:beeper-dev.com NULL NULL Yes No 15 1646688873144 1646688873200 @adam203072022:beeper-dev.com No synapse-event-persister-1 2491941 @adam203072022:beeper-dev.com NULL
16 $7wsVPnOzGCH_ZXUHGfPJBiwF2z1pMTMdyhQLpynUkJ4 m.room.member !vXpiDigZlPILPsLRdK:beeper-dev.com NULL NULL Yes No 16 1646688873408 1646688873442 @_adam203072022_whatsapp_bot:beeper-dev.com No synapse-event-persister-1 2491942 @_adam203072022_whatsapp_bot:beeper-dev.com NULL
17 $utQhwFEYV1mpeHH_uWJUhRW_drV-h0w9SHhvB4yK1RM fi.mau.dummy.portal_created !vXpiDigZlPILPsLRdK:beeper-dev.com NULL NULL Yes No 17 1646688873650 1646688873682 @_adam203072022_whatsapp_xxx:beeper-dev.com No synapse-event-persister-1 2491943 NULL NULL
18 $gdlzxXbXnQMttMwmgj1dVBeCxpYTwumKNj5xpL_nkNM fi.mau.dummy.pre_backfill !vXpiDigZlPILPsLRdK:beeper-dev.com NULL NULL Yes No 18 1646688875517 1646688875539 @_adam203072022_whatsapp_xxx:beeper-dev.com No synapse-event-persister-1 2491945 NULL NULL
19 $Mt9Ldr_g1ld8Y5OagQIQ2cDGHNcAw_rQx3D733SVeVc fi.mau.dummy.backfill_end !vXpiDigZlPILPsLRdK:beeper-dev.com NULL NULL Yes No 19 1646688877845 1646688877876 @_adam203072022_whatsapp_xxx:beeper-dev.com No synapse-event-persister-1 2491946 NULL NULL

Made a sync request with the stream token s2491926_4_300_6622318_1064167_531325_269851_264048_1 and got a next_batch of s2491940_4_300_6622320_1064168_531325_269851_264048_1, but notice how we didn't get the m.bridge event with the stream_ordering of 2491933 in the reponse.

// https://matrix.beeper-dev.com/_matrix/client/r0/sync?filter=0&timeout=30000&since=s2491926_4_300_6622318_1064167_531325_269851_264048_1
{
    "next_batch": "s2491940_4_300_6622320_1064168_531325_269851_264048_1",
    "device_one_time_keys_count": {
        "signed_curve25519": 50
    },
    "org.matrix.msc2732.device_unused_fallback_key_types": [
        "signed_curve25519"
    ],
    "device_unused_fallback_key_types": [
        "signed_curve25519"
    ],
    "rooms": {
        "invite": {
            "!vXpiDigZlPILPsLRdK:beeper-dev.com": {
                "invite_state": {
                    "events": [
                        {
                            "type": "m.room.avatar",
                            "state_key": "",
                            "content": {
                                "url": "mxc://beeper-dev.com/e56a9a291477956ca28034c41a564f38b4826a90"
                            },
                            "sender": "@_adam203072022_whatsapp_xxx:beeper-dev.com"
                        },
                        {
                            "type": "m.room.create",
                            "state_key": "",
                            "content": {
                                "creator": "@_adam203072022_whatsapp_xxx:beeper-dev.com",
                                "m.federate": false,
                                "room_version": "9"
                            },
                            "sender": "@_adam203072022_whatsapp_xxx:beeper-dev.com"
                        },
                        {
                            "type": "m.room.encryption",
                            "state_key": "",
                            "content": {
                                "algorithm": "m.megolm.v1.aes-sha2"
                            },
                            "sender": "@_adam203072022_whatsapp_xxx:beeper-dev.com"
                        },
                        {
                            "type": "m.room.join_rules",
                            "state_key": "",
                            "content": {
                                "join_rule": "invite"
                            },
                            "sender": "@_adam203072022_whatsapp_xxx:beeper-dev.com"
                        },
                        {
                            "type": "m.room.member",
                            "state_key": "@_adam203072022_whatsapp_xxx:beeper-dev.com",
                            "content": {
                                "avatar_url": "mxc://beeper-dev.com/e56a9a291477956ca28034c41a564f38b4826a90",
                                "displayname": "+xxx",
                                "membership": "join"
                            },
                            "sender": "@_adam203072022_whatsapp_xxx:beeper-dev.com"
                        },
                        {
                            "type": "m.room.name",
                            "state_key": "",
                            "content": {
                                "name": "+xxx"
                            },
                            "sender": "@_adam203072022_whatsapp_xxx:beeper-dev.com"
                        },
                        {
                            "type": "m.room.topic",
                            "state_key": "",
                            "content": {
                                "topic": "WhatsApp private chat"
                            },
                            "sender": "@_adam203072022_whatsapp_xxx:beeper-dev.com"
                        },
                        {
                            "content": {
                                "displayname": "adam203072022",
                                "fi.mau.will_auto_accept": true,
                                "is_direct": true,
                                "membership": "invite"
                            },
                            "origin_server_ts": 1646688872941,
                            "sender": "@_adam203072022_whatsapp_xxx:beeper-dev.com",
                            "state_key": "@adam203072022:beeper-dev.com",
                            "type": "m.room.member",
                            "unsigned": {
                                "age": 117
                            },
                            "event_id": "$NbNAmJmGCOQ-fJV9g16kLFOYdC0E9G46XkDsowGoxRE"
                        }
                    ]
                }
            }
        }
    }
}
anoadragon453 commented 2 years ago

@bradtgmurray The invite_state of a room will only include state events types specified by the room_prejoin_state config option. By default m.bridge won't be included. It will need to be added under room_prejoin_state.additional_event_types on the homeserver that sent the invite. Do you have a snapshot of a /sync request with m.bridge missing from a rooms.join.<room_id> map?

(Completely tangentially, note that as MSC2346 hasn't yet completed FCP, the unprefixed identifier m.bridge should only be used in private rooms that do not extend past your own client implementations.)

Fizzadar commented 2 years ago

@anoadragon453 I have re-created the case above from Brad; I've included the event list and trimmed sync responses in this gist: https://gist.github.com/Fizzadar/316dabfe9be30ebaa48e288de2380de1. Some notes:

Notably, this is only occurring during initial backfill using MSC2716 batch sends. When this is disabled on the bridge we have not yet replicated the problem seen here.

reivilibre commented 2 years ago

@Fizzadar Hi; thanks for taking some time to write this up.

I'm still unfamiliar with MSC2716, but I think it may be useful to know the structure of the DAG to figure out what's going on here. Is there any chance you'd be able to run the following?:

SELECT stream_ordering, type, state_key, event_id, (ej.json::json)->'prev_events' AS prev_events FROM events
LEFT JOIN event_json AS ej USING (event_id)
WHERE events.room_id ='!SjekxtmsBigrUMcMkX:beeper-dev.com' ORDER BY stream_ordering ASC;

If the room has been used more since your last message, feel free to trim it to the range that was there previously.

Fizzadar commented 2 years ago

@reivilibre thanks for looking into this; query results:


LEFT JOIN event_json AS ej USING (event_id)
WHERE events.room_id ='!SjekxtmsBigrUMcMkX:beeper-dev.com' ORDER BY stream_ordering ASC;
 stream_ordering |             type             |                          state_key                           |                   event_id                   |                   prev_events
-----------------+------------------------------+--------------------------------------------------------------+----------------------------------------------+--------------------------------------------------
         -113203 | org.matrix.msc2716.insertion |                                                    | $G5GLwX6lc8MXKqwauoM1p9lkLD6S5GfwcosFlPra0ws | []
         -113202 | m.room.encrypted             |                                                    | $I3PsF44AMu5I_XOZQ2Hm1cOhcdLJCbA8NcAMS69hLEE | ["$G5GLwX6lc8MXKqwauoM1p9lkLD6S5GfwcosFlPra0ws"]
         -113201 | m.room.encrypted             |                                                    | $hIO0ZI-ZYr8gWGkWtvBiPfCA0jKdjiZP7PCnEfuutTY | ["$I3PsF44AMu5I_XOZQ2Hm1cOhcdLJCbA8NcAMS69hLEE"]
         -113200 | m.room.encrypted             |                                                    | $-jWSmA7NJ1dbl5b86K8FAKN2ydt-qoOw8ftCdSEN4A4 | ["$hIO0ZI-ZYr8gWGkWtvBiPfCA0jKdjiZP7PCnEfuutTY"]
         -113199 | m.room.encrypted             |                                                    | $5EtpKKW7g5HV5oCweU6vlbGjptR095p-kZSSSy2tBgs | ["$-jWSmA7NJ1dbl5b86K8FAKN2ydt-qoOw8ftCdSEN4A4"]
         -113198 | m.room.encrypted             |                                                    | $5aVzqGvybXzV3Xj3AwA8SQKqp2X2Z7RLH7IGX5VBeX4 | ["$5EtpKKW7g5HV5oCweU6vlbGjptR095p-kZSSSy2tBgs"]
         -113197 | org.matrix.msc2716.batch     |                                                    | $KkSRlRidPiOkkLEuiHs3gWLyCqCdQTHAVYJxzhDIzcM | ["$5aVzqGvybXzV3Xj3AwA8SQKqp2X2Z7RLH7IGX5VBeX4"]
         -111135 | org.matrix.msc2716.insertion |                                                    | $LQBf_c9-1stBXkPOLqZwkJgMg5YBCp448qcUwKLN6ks | []
         -111134 | m.room.encrypted             |                                                    | $RzKCB8rV0a9jfQVfbJeGvgGIZtT_X7ODb-78LUhXpXs | ["$LQBf_c9-1stBXkPOLqZwkJgMg5YBCp448qcUwKLN6ks"]
         -111133 | org.matrix.msc2716.batch     |                                                    | $l_jGmPdlA73wDzS3_gRwC67co1pTnx9-QBufta10xWI | ["$RzKCB8rV0a9jfQVfbJeGvgGIZtT_X7ODb-78LUhXpXs"]
         -111132 | org.matrix.msc2716.insertion |                                                    | $_mT6Awgb70IjwseWpCUeU0O4tm_iT7OBe4IPLJYUHUU | ["$a6xyChJTR8wJLsfYfGI8Tu1W5cXrtohkBRdwCSsyG68"]
         2633404 | m.room.create                |                                                    | $ZMCw8gIYOeNmC2mGNKiAdxbRdwuhDiA9gNaM8EwDlRM | []
         2633421 | m.room.member                | @_adam4_whatsapp_xxx:beeper-dev.com                | $a6LZbYMvr5yYRcFc0uethSHLVC_0l7LRUr0nCZXB2lQ | ["$ZMCw8gIYOeNmC2mGNKiAdxbRdwuhDiA9gNaM8EwDlRM"]
         2633434 | m.room.power_levels          |                                                    | $S2rnt0m3dfbgC0ySGI3RGSdIjNqhgjjilZ2Gm7jGtvg | ["$a6LZbYMvr5yYRcFc0uethSHLVC_0l7LRUr0nCZXB2lQ"]
         2633448 | m.room.join_rules            |                                                    | $5hkAty1mFIMtuFINEh8wA3OyIgnO6Qxh1HgCTT9c1p0 | ["$S2rnt0m3dfbgC0ySGI3RGSdIjNqhgjjilZ2Gm7jGtvg"]
         2633461 | m.room.history_visibility    |                                                    | $nNR6iQSvWPpoKRvfpXJXU_sWpnk8eCwTFJD9zwC2rfo | ["$5hkAty1mFIMtuFINEh8wA3OyIgnO6Qxh1HgCTT9c1p0"]
         2633469 | m.room.guest_access          |                                                    | $iNJMDkjnFwBekoCZASqZmiNtH8UMLsPc_1VHJD6IFpU | ["$nNR6iQSvWPpoKRvfpXJXU_sWpnk8eCwTFJD9zwC2rfo"]
         2633478 | m.bridge                     | net.maunium.whatsapp://whatsapp/xxx@s.whatsapp.net | $p5fs4ik0Lnj-mIbw-cjPHX_aWzx0x6O_rIrhSi8_1C8 | ["$iNJMDkjnFwBekoCZASqZmiNtH8UMLsPc_1VHJD6IFpU"]
         2633492 | uk.half-shot.bridge          | net.maunium.whatsapp://whatsapp/xxx@s.whatsapp.net | $tAZ3y00piOU_54qs0MuK-HRPivXNRBmN8x6yY5KWXkk | ["$p5fs4ik0Lnj-mIbw-cjPHX_aWzx0x6O_rIrhSi8_1C8"]
         2633503 | m.room.avatar                |                                                    | $BZOaOEhozp28cCJtdh8A6R2hKgXiuDmd1NWI2vEuyPs | ["$tAZ3y00piOU_54qs0MuK-HRPivXNRBmN8x6yY5KWXkk"]
         2633515 | m.room.encryption            |                                                    | $qTv8IJZIdlH_3S7atbGTrd-7p7UEt4siocS0vz8-K3Y | ["$BZOaOEhozp28cCJtdh8A6R2hKgXiuDmd1NWI2vEuyPs"]
         2633523 | m.room.name                  |                                                    | $joClbFehqiqiVOuYYDWGJeYaozJnaccZgVog8j4bB6g | ["$qTv8IJZIdlH_3S7atbGTrd-7p7UEt4siocS0vz8-K3Y"]
         2633533 | m.room.topic                 |                                                    | $WP0ALeVvT2oCRSt9qs4_bE2O5AKsu_4GqbTCPFwaKa4 | ["$joClbFehqiqiVOuYYDWGJeYaozJnaccZgVog8j4bB6g"]
         2633547 | m.room.member                | @_adam4_whatsapp_bot:beeper-dev.com                | $hq4kZZPt4ItYdBF_I_AVITFHOZOZ2b_Av8ss3F06byg | ["$WP0ALeVvT2oCRSt9qs4_bE2O5AKsu_4GqbTCPFwaKa4"]
         2633571 | m.room.member                | @adam4:beeper-dev.com                              | $DyFoSr8CbRUyexbx6_yuZxgS4S7lAEHDYgldiRGnppc | ["$hq4kZZPt4ItYdBF_I_AVITFHOZOZ2b_Av8ss3F06byg"]
         2633582 | m.room.member                | @adam4:beeper-dev.com                              | $51gzIcw5d1fvbNLqYXdBwma669-H8DZlz0UzsavMBbU | ["$DyFoSr8CbRUyexbx6_yuZxgS4S7lAEHDYgldiRGnppc"]
         2633590 | m.room.member                | @_adam4_whatsapp_bot:beeper-dev.com                | $Sx9mswlxUJs1rRL4UyHARmY7V0h8_cH_c_ZDnwTn3cc | ["$51gzIcw5d1fvbNLqYXdBwma669-H8DZlz0UzsavMBbU"]
         2633601 | fi.mau.dummy.portal_created  |                                                    | $a6xyChJTR8wJLsfYfGI8Tu1W5cXrtohkBRdwCSsyG68 | ["$Sx9mswlxUJs1rRL4UyHARmY7V0h8_cH_c_ZDnwTn3cc"]
         2633614 | fi.mau.dummy.pre_backfill    |                                                    | $E97WApcmVmCJ6nt2WL3TnuxhyCLnPl1dt-t6p25ZLwY | ["$a6xyChJTR8wJLsfYfGI8Tu1W5cXrtohkBRdwCSsyG68"]
         2633633 | fi.mau.dummy.backfill_end    |                                                    | $NILDi9SfVDgaW72alIEsc5ESzdJSoLip_c7klMAeuPo | ["$E97WApcmVmCJ6nt2WL3TnuxhyCLnPl1dt-t6p25ZLwY"]
(30 rows)```
reivilibre commented 2 years ago

@Fizzadar Oooh, interesting (maybe, likely need to think about it! whilst reading the batch send MSC!). What I find interesting is that there are multiple events with no prev_events (which is what I expect from the batch sending stuff) but no later events have more than 1, to merge the forks together. (Only making remarks so I can keep track of this; not sure if that's important or not.) Are those all the events in the room; if not, are there any that merge later on?

Fizzadar commented 2 years ago

@reivilibre that's everything in the room! We left it alone after replicating the issue in case this data is useful :)

bradtgmurray commented 2 years ago

I wonder if it makes sense to split out our line of discussion, because presumably this is something to do with batch_send and not whatever the original issue was back in Feb 2021.

anoadragon453 commented 2 years ago

Given the complexity of how Synapse's implementation ofMSC2716 interacts with the room DAG, it does appear likely that the bug in your case @bradtgmurray is related to batch_send code - specifically the /sync bit of it. Moving and tackling this case in a separate issue would be ideal.

I see that https://github.com/mautrix/whatsapp/pull/463 has been linked to this issue though - do you find that backfilling the room works to recover the missing event in all cases?

bradtgmurray commented 2 years ago

The whatsapp PR is just us trying things attempting to work around the issue, haven't gotten to the bottom of it yet.

Do you know if the DAG we posted above looks correct to you?

Fizzadar commented 2 years ago

I've split out relevant parts into a new issue https://github.com/matrix-org/synapse/issues/12281, let's continue discussions there.

richvdh commented 2 years ago

I'm not really sure what's going on here. It seems like there are:

In any case, there is probably more than one bug here, and lumping further reports into a single issue is likely to cause confusion.

As such, can this be closed? I don't think there's any useful context here.

anoadragon453 commented 2 years ago

@richvdh I agree - this issue has become a bit of a mishmash of related problems, with an awful lot of old comments to read before hitting content relevant to today's ecosystem.

Closing this issue as it looks to have run its course. Please file a new issue if you have one; potentially linking this one if relevant.