Servers send other servers events. These events have prev_events. It's possible for a receiving server to be missing those prev_events, creating a hole in the DAG (aka an outlier). In an attempt to fill this hole, there's the API /get_missing_events which takes the latest event IDs and the earliest event IDs and gives you events walking back from latest and ignoring anything in earliest (don't be conned into thinking it returns events "between" the two lists, it doesn't have to in the case of forks).
In the happy case:
We receive a transaction with events whose prev_events we do not recognise.
We request them via /get_missing_events and the returned events fill in the hole in the DAG.
We process those missing events and then the event from the transaction.
We return 200 OK to the transaction.
If we cannot obtain the prev_events, we can request the /state of the room at the event and continue on.
The server may lie and say they do not know the prev_event, forcing the server to hit /state which then lies about the entire room state.
We can try to guard against lies by forcing the server who sent us the event to cough up the prev_events or else their transaction will be rejected.
In addition, the client needs to be informed of a new hole in the DAG, or else they will never hit /messages (and hence backfill) the hole, resulting in a gap in message history e.g due to lost connectivity on the server (this is exacerbated for p2p nodes). We need to send a limited sync to reset the client in this scenario.
The quick fix (which doesn't really fix everything):
On receiving a txn with missing prev_events, call /get_missing_events with limit=10 (synapse parity)
If those events fill the hole then fab, prepend them to the transaction and process away.
The proper fix:
The ability to reset a room from syncapis perspective (and that translating to a limited sync)
Moving the backwards extremity logic from syncapi to the roomserver so when we receive a QueryBackfill we can service from the roomserver db initially, then backfill when it hits a hole.
Modify the BFS logic in QueryBackfill to return the list of event IDs which are the furthest back it has from that event.
Hit /get_missing_events with those event IDs and the latest events of the main DAG
Problem context
Servers send other servers events. These events have
prev_events
. It's possible for a receiving server to be missing thoseprev_events
, creating a hole in the DAG (aka an outlier). In an attempt to fill this hole, there's the API/get_missing_events
which takes thelatest
event IDs and theearliest
event IDs and gives you events walking back fromlatest
and ignoring anything inearliest
(don't be conned into thinking it returns events "between" the two lists, it doesn't have to in the case of forks).In the happy case:
prev_events
we do not recognise./get_missing_events
and the returned events fill in the hole in the DAG.If we cannot obtain the
prev_events
, we can request the/state
of the room at the event and continue on.There are many bad cases:
prev_events
or the requesting server may not be allowed to see those events./state
which then lies about the entire room state.We can try to guard against lies by forcing the server who sent us the event to cough up the prev_events or else their transaction will be rejected.
In addition, the client needs to be informed of a new hole in the DAG, or else they will never hit
/messages
(and hence backfill) the hole, resulting in a gap in message history e.g due to lost connectivity on the server (this is exacerbated for p2p nodes). We need to send a limited sync to reset the client in this scenario.