DAG sync - Githubissues

staltz commented 1 year ago

I want to have a dedicated thread for this topic, since it's complex.

Some comments from #7 by @gpicron:

instead of being a message on a meta feed you must replicate, this is a message that is the first message of a feed. There is potentially a link to some creator signing key, and creator may inform of its existence in another feed, but there is no direct coupling.

It also generalize the replication algorithm and implies for 2 peers to sync there knowledge of DAG's rooted at some message both for feed, thread, crdts, etc. And it makes creating Feed cheap. You can have several Feed with the same writer key. You can rotate the writer key/algorithm.

I'm concerned about requiring the first message in a feed, because that means we cannot do sliced replication (see a glossary below). Sliced replication is a MUST in SSB2, so we can forget the past.

But even if we don't require the first message, replicating the DAG might still be difficult with the algorithms we dreamed so far.

Suppose Alice has (just for conversation's sake, let's use sequence numbers) messages 50–80 from Carol's feed, and Bob has messages 60–90 from Carol's feed. So when Alice syncs with Bob, she wants messages 81–90 from Bob, but if Bob uses a merkle tree or a bloom clock stamp, then he'll do that from 60–90 and Alice will do that from 50–80. They have different starting points.

Worse, let's consider the case where Alice has 50–80 and Bob has 60–80. They are both up-to-date with each other, and nothing needs to be sent over the wire. But their bloom clock stamps and merkle trees are going to be different just because of the starting point being different.

With linear feeds (and sequence numbers), it's easy to just send these starting points and end points, but with a DAG we don't have sequence numbers. So what representation are we going to use to compress "my DAG" and "their DAG" and compare them?

Glossary

Partial replication: an abstract umbrella concept for "not replicating everything"
Subfeed replication: replicating only some select leaf feeds in a metafeed tree
Sliced replication: replicating a [lowerBound, upperBound] slice/range of a certain feed
Subset replication: fetching a subset of messages from a feed, in random order, based on a certain query

gpicron commented 1 year ago

Then I repeat the question because I don't understand how you cope with the general case with what you are currently implementing.

erikmav commented 1 year ago

On the various global parameters for the anchor algorithm: In various places above are mentioned the following points at which an anchor is generated:

Once per week
Once every 100 messages
Every 10KB of messages

I presume these are joined by an "or" operator. Based on later discussion it seems like anchors are an accepted concept for SSB2. Questions related to these parameters:

Let's say a user is a very infrequent publisher, such that one message a month is typical, with longer gaps in the 2-6 month range. Especially also this person does not open Manyverse-SSB2 or other SSB2 implementations except to publish. Is it acceptable for an anchor to have only 99 messages or 9KB of messages published yet have no new anchor after a week? If so (and I agree with this case), the "once per time interval T" must be specified to mean "only if the key pair for this feed, has not surpassed the other limits and has been connected to an SSB2 feed after the last-published non-anchor message." Or something similar that allows a dangling tail message potentially forever with no new anchor.
These limits need to be finalized for the whole network as part of finishing the design discussion and prototyping and moving to a formal spec. As an implementer I would expect to encode security mitigations related to these limits such that I would either prevent feeds from exceeding them (naive, lowest-cost mitigation), or at least downgrade resources or priority for a feed if it exceeds these limits. Given various discussions of typical SSB message sizes (~0.5K each?), 10KB seems low or 100 messages seems high. Would the limits be better set as one of (10KB | 20 messages), (25KB | 50 messages), (50KB | 100 messages)?

gpicron commented 1 year ago

Personaly I don't think the number of messages is useful as threshold.

For the period, I think something like 3 months is largely enough. The main driving threshold should the amount of data to replicate.

The algorithm described in https://github.com/ssbc/ssb2-discussion-forum/issues/16#issuecomment-1491519470 so that in addition to a « wish » date in the past, peers share a wished maximum amount of data per identity. So that the connected peers can determine the starting anchors to take using both informations.

I think those params can be defined per context type in a spec registry. We most often think term of the microblogging feeds, but if we use several CTX as proposed in that thread, we can be smarter. For instance, for a context ´social-contact’, if we compact a snapshot’ of changes since previous anchor with the anchors, the best is to always take last anchor know as ref during replication and it will be probably the period threshold that will guide the emission of anchors.

For the case 1. In my head, and if understood well, this is what you explained. When a user post a message on its app, the app check first if the period since last anchor is not expired. If expired, emit anchor before the new message. Else check amount data accumulated since last anchor if you add this message. If over the threshold, emit an anchor and then the message. Else just emit the message.

For point 2. I partially gave my opinion with the « registry of context type » as part of the spec containing parameters. For my personal simulations, in real feed and on generated feeds, I think that thresholds can be much larger than weeks and a few KB. I think something like 3 months and 1MB is small enough granularity for ´social-post' context given the requirements of total footprint given by @staltz

ssbc / ssb2-discussion-forum

DAG sync #10

Glossary