Inform syncapi about holes in DAGs

kegsay commented 4 years ago

Problem context

Servers send other servers events. These events have prev_events. It's possible for a receiving server to be missing those prev_events, creating a hole in the DAG (aka an outlier). In an attempt to fill this hole, there's the API /get_missing_events which takes the latest event IDs and the earliest event IDs and gives you events walking back from latest and ignoring anything in earliest (don't be conned into thinking it returns events "between" the two lists, it doesn't have to in the case of forks).

In the happy case:

We receive a transaction with events whose prev_events we do not recognise.
We request them via /get_missing_events and the returned events fill in the hole in the DAG.
We process those missing events and then the event from the transaction.
We return 200 OK to the transaction.

If we cannot obtain the prev_events, we can request the /state of the room at the event and continue on.

There are many bad cases:

The server may be missing the prev_events or the requesting server may not be allowed to see those events.
The server may lie and say they do not know the prev_event, forcing the server to hit /state which then lies about the entire room state.

We can try to guard against lies by forcing the server who sent us the event to cough up the prev_events or else their transaction will be rejected.

In addition, the client needs to be informed of a new hole in the DAG, or else they will never hit /messages (and hence backfill) the hole, resulting in a gap in message history e.g due to lost connectivity on the server (this is exacerbated for p2p nodes). We need to send a limited sync to reset the client in this scenario.

kegsay commented 4 years ago

The quick fix (which doesn't really fix everything):

On receiving a txn with missing prev_events, call /get_missing_events with limit=10 (synapse parity)
If those events fill the hole then fab, prepend them to the transaction and process away.

The proper fix:

The ability to reset a room from syncapis perspective (and that translating to a limited sync)
Moving the backwards extremity logic from syncapi to the roomserver so when we receive a QueryBackfill we can service from the roomserver db initially, then backfill when it hits a hole.
Modify the BFS logic in QueryBackfill to return the list of event IDs which are the furthest back it has from that event.
Hit /get_missing_events with those event IDs and the latest events of the main DAG

kegsay commented 4 years ago

This is mostly resolved now, but:

We don't handle rejected events very well.
We need to tell the syncapi about holes.
We need the syncapi to reset clients sensibly so they can /messages.

matrix-org / dendrite

Inform syncapi about holes in DAGs #1006

Problem context