Closed erikjohnston closed 1 year ago
I think the fix here is to somehow detect that we have a backwards extremity between the prev_batch
and current token, and if so set the limited
flag (ensuring we don't send down any events from before the potential gap)
Not directly related, but MSC3871 talks a bit about detecting gaps in timelines. (Although from a /messages
POV.)
The syncing clients will receive the new events down their timeline, however the
limited
flag is not set. This means the client thinks there is no gap, and so won't try and backfill any missed messages.
Why is the limited
flag not set? Is it because because 10 events is less than the default 20 events that clients usually request?
The remaining events will be backfilled when a client does an appropriate call to
/messages
.
Would #13576 help at all? (Proactively backfilling more events.)
The syncing clients will receive the new events down their timeline, however the
limited
flag is not set. This means the client thinks there is no gap, and so won't try and backfill any missed messages.Why is the
limited
flag not set? Is it because because 10 events is less than the default 20 events that clients usually request?
Yup, I believe so.
The remaining events will be backfilled when a client does an appropriate call to
/messages
.Would #13576 help at all? (Proactively backfilling more events.)
It'd mitigate this a bit, yes.
I guess a band-aid might be to try to backfill 100 events or something. This won't work though when the netsplit is only 11 events.
I'm having some trouble tracking here how the backfill would even work. Though, currently the homeserver wouldn't attempt to backfill those until the user backscrolled (via /messages
) to where topologically the netsplit occurred, correct? (What I'm asking is, even with the limited
flag set, how would a client get those messages easily due to the split in ordering between /messages
and /sync
-- see https://github.com/matrix-org/matrix-spec/issues/852).
Oh, hmm.
Firstly: I care less right now if you have backpaginate a bunch before you actually do a backfill, but agreed that is sucky.
Secondly, I think that if clients /messages
set from
to be prev_batch
and to
to be the sync token returned by the previous sync, we'll never actually paginate to the topological ordering where we'd trigger backfill. I'm not sure if clients actually set a to
?
I'm not sure if clients actually set a
to
?
I do see some hits for it, but they all are either a 401 error with missing access token or the client disconnects before a response is sent?
$ grep "to=" synchrotron*.log | grep -v " 401 " | grep -v "already disconnected"
Oh, that's exciting!
I think the fix here is to somehow detect that we have a backwards extremity between the
prev_batch
and current token, and if so set thelimited
flag (ensuring we don't send down any events from before the potential gap)
We talked a bit about this on the phone, it might work to check something about the topo ordering of the first (or last?) event sent down sync vs. the latest topo of the room or something. I'm unsure this would work.
Maybe another thing would be if any of the events in the room since the since
token have a stream ordering that's negative or a prev event that has a negative stream ordering?
Looks like artificially forcing the limited flag doesn't use the to
parameter on element web at least, so that might still be enough for a short term fix. It looks to throw away the entire timeline and paginate it back in?
My plan is to add a table that records whenever we receive an event over federation which we don't have the prev events for (and haven't fetched), i.e. where we have a "gap". Then we use that to force the limited
flag.
If there has bee e.g. a netsplit in a room where the remote servers have sent more than ten events, then when the netsplit resolves the local server will fetch up to 10 events. The remaining events will be backfilled when a client does an appropriate call to
/messages
.The syncing clients will receive the new events down their timeline, however the
limited
flag is not set. This means the client thinks there is no gap, and so won't try and backfill any missed messages.The net result is that clients won't ever see the older events in the remote fork.