`/sync` doesn't inform clients when there is a gap in the timeline

matrix-org / synapse

Synapse: Matrix homeserver written in Python/Twisted.

https://matrix-org.github.io/synapse

Apache License 2.0

11.83k stars 2.13k forks source link

`/sync` doesn't inform clients when there is a gap in the timeline #16463

Closed erikjohnston closed 1 year ago

erikjohnston commented 1 year ago

If there has bee e.g. a netsplit in a room where the remote servers have sent more than ten events, then when the netsplit resolves the local server will fetch up to 10 events. The remaining events will be backfilled when a client does an appropriate call to /messages.

The syncing clients will receive the new events down their timeline, however the limited flag is not set. This means the client thinks there is no gap, and so won't try and backfill any missed messages.

The net result is that clients won't ever see the older events in the remote fork.

erikjohnston commented 1 year ago

I think the fix here is to somehow detect that we have a backwards extremity between the prev_batch and current token, and if so set the limited flag (ensuring we don't send down any events from before the potential gap)

clokep commented 1 year ago

Not directly related, but MSC3871 talks a bit about detecting gaps in timelines. (Although from a /messages POV.)

clokep commented 1 year ago

The syncing clients will receive the new events down their timeline, however the limited flag is not set. This means the client thinks there is no gap, and so won't try and backfill any missed messages.

Why is the limited flag not set? Is it because because 10 events is less than the default 20 events that clients usually request?

clokep commented 1 year ago

The remaining events will be backfilled when a client does an appropriate call to /messages.

Would #13576 help at all? (Proactively backfilling more events.)

erikjohnston commented 1 year ago

The syncing clients will receive the new events down their timeline, however the limited flag is not set. This means the client thinks there is no gap, and so won't try and backfill any missed messages.

Why is the limited flag not set? Is it because because 10 events is less than the default 20 events that clients usually request?

Yup, I believe so.

The remaining events will be backfilled when a client does an appropriate call to /messages.

Would #13576 help at all? (Proactively backfilling more events.)

It'd mitigate this a bit, yes.

clokep commented 1 year ago

I guess a band-aid might be to try to backfill 100 events or something. This won't work though when the netsplit is only 11 events.

I'm having some trouble tracking here how the backfill would even work. Though, currently the homeserver wouldn't attempt to backfill those until the user backscrolled (via /messages) to where topologically the netsplit occurred, correct? (What I'm asking is, even with the limited flag set, how would a client get those messages easily due to the split in ordering between /messages and /sync -- see https://github.com/matrix-org/matrix-spec/issues/852).

erikjohnston commented 1 year ago

Oh, hmm.

Firstly: I care less right now if you have backpaginate a bunch before you actually do a backfill, but agreed that is sucky.

Secondly, I think that if clients /messages set from to be prev_batch and to to be the sync token returned by the previous sync, we'll never actually paginate to the topological ordering where we'd trigger backfill. I'm not sure if clients actually set a to?

clokep commented 1 year ago

I'm not sure if clients actually set a to?

I do see some hits for it, but they all are either a 401 error with missing access token or the client disconnects before a response is sent?

$ grep "to=" synchrotron*.log | grep -v " 401 " | grep -v "already disconnected"

erikjohnston commented 1 year ago

Oh, that's exciting!

clokep commented 1 year ago

I think the fix here is to somehow detect that we have a backwards extremity between the prev_batch and current token, and if so set the limited flag (ensuring we don't send down any events from before the potential gap)

We talked a bit about this on the phone, it might work to check something about the topo ordering of the first (or last?) event sent down sync vs. the latest topo of the room or something. I'm unsure this would work.

Maybe another thing would be if any of the events in the room since the since token have a stream ordering that's negative or a prev event that has a negative stream ordering?

erikjohnston commented 1 year ago

Looks like artificially forcing the limited flag doesn't use the to parameter on element web at least, so that might still be enough for a short term fix. It looks to throw away the entire timeline and paginate it back in?

erikjohnston commented 1 year ago

My plan is to add a table that records whenever we receive an event over federation which we don't have the prev events for (and haven't fetched), i.e. where we have a "gap". Then we use that to force the limited flag.