ssbc / ssb2-discussion-forum

not quite tiny, also not quite large
17 stars 1 forks source link

DAG sync #10

Open staltz opened 1 year ago

staltz commented 1 year ago

I want to have a dedicated thread for this topic, since it's complex.

Some comments from #7 by @gpicron:

instead of being a message on a meta feed you must replicate, this is a message that is the first message of a feed. There is potentially a link to some creator signing key, and creator may inform of its existence in another feed, but there is no direct coupling.

It also generalize the replication algorithm and implies for 2 peers to sync there knowledge of DAG's rooted at some message both for feed, thread, crdts, etc. And it makes creating Feed cheap. You can have several Feed with the same writer key. You can rotate the writer key/algorithm.

I'm concerned about requiring the first message in a feed, because that means we cannot do sliced replication (see a glossary below). Sliced replication is a MUST in SSB2, so we can forget the past.

But even if we don't require the first message, replicating the DAG might still be difficult with the algorithms we dreamed so far.

Suppose Alice has (just for conversation's sake, let's use sequence numbers) messages 50–80 from Carol's feed, and Bob has messages 60–90 from Carol's feed. So when Alice syncs with Bob, she wants messages 81–90 from Bob, but if Bob uses a merkle tree or a bloom clock stamp, then he'll do that from 60–90 and Alice will do that from 50–80. They have different starting points.

Worse, let's consider the case where Alice has 50–80 and Bob has 60–80. They are both up-to-date with each other, and nothing needs to be sent over the wire. But their bloom clock stamps and merkle trees are going to be different just because of the starting point being different.

With linear feeds (and sequence numbers), it's easy to just send these starting points and end points, but with a DAG we don't have sequence numbers. So what representation are we going to use to compress "my DAG" and "their DAG" and compare them?

Glossary

staltz commented 1 year ago

I got some exciting thoughts going on.

(First of all, on principles: we should aim for a design that is simple to implement and reason about. It might not be the most computer sciency approach, and it might not be optimal/tight in terms of performance or storage, but it is simple to implement and good enough in all other aspects.)

I considered the idea of having "synchronization points", other names for this are "consensus points", or "anchors". Lets use the name "anchors" because it's cute. ⚓

An anchor is a msg on the feed, which merges several previouses, and happens with a rare-enough frequency. Let's say the frequency is once per week. Or, once every 100 messages. The anchor message would have a special field "anchor": true.

Once Alice and Bob want to sync their knowledge of Carol's feed, their exchange with each other their latest anchor msg ID. Based on that, they can reach a consensus on which anchor point to use as a starting point for the DAG comparison. The common anchor between Alice and Bob could be e.g. the "newest" anchor.

Given the common anchor, each peer traverses forward in the DAG, building a representation using some data structure such as bloom clock or whatever.

Any given peer may have many anchors, this depends on how much each peer wants to use up storage. It would be good to have a globally agreed on frequency. I think e.g. 10kB worth of messages might be a candidate for the anchor frequency.

Now here's the catch. If one of your devices was offline for 10 months ("tablet"), and it has zero common anchors with other devices, then the synchronization algorithm would determine that tablet needs to start from scratch, so tablet should delete all messages from this feed, and redownload everything that another device has.

This is a tradeoff that says that if you use the same account on multiple devices, if one of those devices can't catch up, then too bad. They'll have to do a full resync. Which isn't too bad! Because a full resync of a small feed is small.

Thoughts?

staltz commented 1 year ago

Also, deletions could be done at anchor points, such that every msg before the (newly elected) "oldest" anchor is deleted.

staltz commented 1 year ago

Diagram of the previous idea:

---
title: Alice's copy of Carol's feed
---
graph LR;
  0["0 ⚓"]-->1-->2a[2] & 2b[2']-->3["3 ⚓"]-->4-->5
---
title: Bob's copy of Carol's feed
---
graph LR;
  0["0 ⚓"]-->1-->2a[2] & 2b[2']
  2b-->3b[3']
sequenceDiagram
    Alice->>Bob: My latest anchor is 3
    Bob-->>Alice: My latest anchor is 0
    Alice->>Bob: I have anchor 0 locally, so let's agree it's our common anchor
    Bob-->>Alice: Ok
    Alice->>Bob: My bloom clock from 0 to latest
    Bob-->>Alice: My bloom clock from 0 to latest
    Alice->>Bob: you are missing 3, 4, 5
    Bob-->>Alice: you are missing 3'
staltz commented 1 year ago

:bomb: The problem with this idea is if Bob has (for some reason) a forked anchor, say 9 ⚓ versus 9' ⚓. What do we do with forked anchors?

Well, here's a proposal: the common anchor has to be, well common. And singular, not in plural. So peers can just use an older anchor until they find a common one.

Diagram of the previous idea:

---
title: Alice's copy of Carol's feed
---
graph LR;
  0["0 ⚓"]-->1-->2a[2] & 2b[2']-->3["3 ⚓"]-->4
---
title: Bob's copy of Carol's feed
---
graph LR;
  0["0 ⚓"]-->1-->2a[2] & 2b[2']-->3["3' ⚓"]-->4[4']
sequenceDiagram
    Alice->>Bob: My latest anchor is 3
    Bob-->>Alice: My latest anchor is 3'
    Alice->>Bob: Unrecognized. My SECOND latest anchor is 0
    Bob-->>Alice: My second latest anchor is 0
    Alice->>Bob: Let's agree it's our common anchor is 0
    Bob-->>Alice: Ok
    Alice->>Bob: My bloom clock from 0 to latest
    Bob-->>Alice: My bloom clock from 0 to latest
    Alice->>Bob: you are missing 3, 4
    Bob-->>Alice: you are missing 3', 4'
---
title: Result after syncing
---
graph LR;
  0["0 ⚓"]-->1-->2a[2] & 2b[2']-->3["3 ⚓"]-->4
  2a[2] & 2b[2']-->3b["3' ⚓"]-->4b[4']
gpicron commented 1 year ago

You get it. This is exactly where I wanted to go with the Feed Bootstrap Message. Let's call is Anchor Message.

Imagine the following. I will call the base unit of transmission a Bundle to avoid confusion with the notion of Message in SSB and be more generic (because a Bundle can be also a Blob in SSB)

Bundle model

Note: the unique identifier of a Bundle is the hash of the Primary Block in multibase format using the hashType specified in the signature block if present. If not present, use some default type like SHA512-256 like today. This id can be used to refer to that Bundle in some backlinks.

Administrative records.

When the flag "administrative record" is set, the payload contains one administrative record.

Feed Anchor Administrative Record

Several cases depending on the backlinks:

Other relations for backlink

'blob' backlinks

Point to a bundle containing the Blob. Such bundle may be not signed as today if you which.

'clock' backlinks

Point to a Common Clock Counter bundle. Which allow to prove that the bundle was created after some consensual time.

(Partial) Consensus on Elapsed Time.

Actually what you describe as common anchor is very near to what I proposed as a "Proof of Elapsed Time" consensus on SSB in https://gpicron.github.io/ssb-clock-spec/01-design/#protocol-for-establishing-a-consensus-on-elapsed-time. It may be used as you described to "query" for the Feed Anchor Administrative Record that were created after some Common Clock counter value. The Common Clock counter can also be used as a "contractual" element regarding the respect of the truncation and limits (Imagine a pub server that do not respect a truncation policy that says that the feed should be delete after 2 years. If the pub is participating to the Consensus on Elapsed Time and had committed on some count of more than 2 years, it cannot pretend that it didn't deleted because of a local clock drift... you can prove that the time was elapsed and that it was aware of it). The Clock counter may an optional Block in the Bundle. When one make a backlink to a Bundle containing that block with 'clock' relation, it is due to insert that block in the Bundle with the counter updated as described in the doc.

Network identity

The network identity used for handshake can be a Feed... Later we may imagine to add something like TrustChain.

About deletion and update

With that bundle format and a replication from latest back to the Feed Anchor Administrative Record (or other DAG root) to update or delete a bundle, one can create a new bundle with the same Feed or a previous Feed with a backlink "update" and a new payload or an empty payload (deletion). Nodes that receive that Bundle and had the pointed Bundle yet are due to drop the payload block but should continue to distribute all the other blocks of the bundle.

ahdinosaur commented 1 year ago

I considered the idea of having "synchronization points"

i'm not sure i understand, why do the synchronization points need to be a special message? since each feed is a tangle, what is the difference between a special anchor message and a common point in the feed where the tangle has only one head?

for what it's worth, i'm very excited about the possibility of ssb2 using what i think is the most important learning of ssb1: tangles. if we can solve the question of how to efficiently replicate tangles, we can efficiently replicate anything: users, threads, groups, etc. and since tangle replication is much more targeted, being efficient is a relative metric compared to when we needed to replicate an entire feed just to get a few messages.

gpicron commented 1 year ago

This is because to efficiently sync (without having to exchange a ton of messages) you need ideally a more limited set of possible common points of comparison and because such 'checkpoints' can offer a solution for a series of concerns. This not strictly mandatory to sync but has serious advantage and is simpler

staltz commented 1 year ago

I haven't read yet your longer comment, Geoffrey, but I'll get to it soon.

I came here to drop this important link from Aljoscha on tangle sync (aka "set reconciliation"): https://arxiv.org/abs/2212.13567 and precursor https://github.com/AljoschaMeyer/set-reconciliation.

ahdinosaur commented 1 year ago

thanks @gpicron

just a side note, if we add checkpoints, i wonder if there's a way to use checkpoints as certificates to re-gain log verification, a la bamboo log verification using limpaalinks. since as i understand so far, ssb2 gives up strict log verification in exchange for partial replication. but can we have both? or at least where log verification is optional but doesn't require complete replication.

a simple idea would be having multiple cycle periods in the checkpoints. in the shortest cycle, each checkpoint points to the previous checkpoint. in the next cycle (with a longer period), every 10 checkpoints is a deca-checkpoint, and every checkpoint points to both the previous checkpoint and their previous deca-checkpoint. and every 100 checkpoints is a hecto-checkpoint, and every checkpoint points to all of the previous checkpoint and the previous deca-checkpoint and the previous hecto-checkpoint, and so on...

with some thought, there's probably a not-so-complicated way to encode a structure like limpaalinks in the checkpoints, where there's a shortest path from every checkpoint to the root message, where that shortest path distance grows logarithmically with the number of messages.

anyways, just some musings, cheers. :purple_heart:

gpicron commented 1 year ago

Actually, I tried to make an more explicit example that illustrate that idea to keep the log trust in #12 ... Sorry, this is going a bit in all direction, but as looking for a solution that for various concerns, I suppose it is normal...

gpicron commented 1 year ago

@staltz I have yet digged in the work of Aljoscha since Basel and we discussed a bit the ideas in https://gpicron.github.io/ssb-clock-spec/.

To sum up:

  1. The phase 1 and 2 of the algorithm proposed in ssb-clock-spec is probabilistically detecting and determining the disymmetries betwen the 2 sets with 4 messages (2 per peers). So the probabilistic reconciliation is done (number of disymmetries + 4 messages) with a bounded false positive rate. I think that the computational complexity is lower with bloom because it is mainly comparisons of integer arrays and bitsets while fingerprints requires fingerprint generation + comparisons. But the memory usage is different, less CPU cache friendly, so I'm not 100% sure and that need tests.
  2. the phase 3 in the proposal, based merkle trees, is more or less the same idea as in the fingerprint based set reconciliation in Aljoscha paper. But as I wrote elsewhere, I think that phase need to be revised. The cases of false positive of phase 1 and 2 that are not guaranteed to be eventually detected and revolved due to multi-peer repetitive connections are limited and of very low probability, so that I think a simple branch tip fingerprint check, computationally very cheap (to detect false positive), and a linear reconciliation, simpler and limited to the branches ending with the same Bloom Clock Stamp, is sufficient for such improbable situation.
arj03 commented 1 year ago

First I'd like to say that I really like your bundle model design @gpicron. Also that is takes care of blobs! I still stand by my stance that complexity is the enemy :) But 👍 on your design, it looks well thought out.

As for the replication, I'm wondering why you need the first phase (the probabilistic), why not just always do the set replication? I'm assuming you can do a ton of caching to speed things up.

I'll also drop a link to the earthstar implementation of aljoscha's set replication: https://github.com/earthstar-project/range-reconcile

gpicron commented 1 year ago

First of all, this not complement my design. This based on the design of Delay Tolerant Network Bundle Protocol 7... I have just added roughly 2 extension blocks for chaining.

For the the phase 1, you are right. It will not make a huge difference in term of bytes to transfer. A bit more in term of cpu an memory because it doesn't require deserialization and intersections of compressed bitsets is super efficient (that's the base of all inverted indexes and search engines).

Now focusing on mobile does not mean that we have to forget about others. Personally I will keep some kind pub/oasis at home with a 3 or 4 hops replications to get a broader sense of the ssb community. When my mobile will connect to it, the list it will send to my mobile will large

Meanwhile I see some added value in it.

The first one is the contribution of each to the network. Let say that we decide for a maximum size of Bloom Filter and number of hashes so that the FP rate is about 0,1% for normal use. That mean that in that a peer will receive in average 0,1% feeds that do not interest him. But that may interest other peers it is connected too in the current session. That seems fair and improve the data diffusion. Moreover the larger the list of feeds one request to the network to give him, the larger its contribution in return to improve the diffusion will be.

The second is the user controlled anonymization that it allow. It is possible to define simply a set parameter setups that generate of bloom filter that are always intersectable but a different degree of FP. That allow one to more or less hide who he is following (concept of credible deniability). The counterparty being to contribute more to the network data diffusion. And actually cases I can imagine where such feature may be useful, both these aspect are needed( I think of Iran people that need to hide their connections and prefer to avoid national infrastructure to transmit info and probably rely more on opportunistic WiFi hotspots)

staltz commented 1 year ago

@gpicron Unrelated to the above, but did you know that Aljoscha's thesis mentions bloom filters? Not quite bloom clocks, but it mentions invertible bloom filters. Page 52 on his thesis: https://github.com/AljoschaMeyer/master_thesis/blob/main/main.pdf

staltz commented 1 year ago

Here's a wild thought: why do we need to send a compressed fingerprint over the wire? Bloom filters or bloom clocks or (Aljoscha's) hashed fingerprints with balanced search trees etc. What if we just sent an array of the message IDs within that range? Even better, let's send the first 4 bytes (or whatever other truncation) of each message ID. This isn't a lot of data!

Suppose that Alice is "200 messages behind" Bob when it comes to replicating Carol's feed. 200 messages is a stretch, that means that Carol has posted a lot, or Alice has been offline for quite some time. Most cases will have less than 200, easily. With 4 bytes for each message hash, that's just 800 bytes, not even 1 kB. Okay, depending on how you measure this it may be different, maybe half a kilobyte, maybe 3 kB. But still, not prohibitive.

I am curious about implementing this naive approach, measuring performance, and seeing what kind of compromises we can do. Message hashes of a single feed are very often unique enough with just 4 bytes, so that's probably going to work when replicating a DAG feed. When replicating a tangle/thread consisting of many authors, then we probably need 4 bytes for the author ID and 4 bytes for the message, so 8 bytes for each msg. But threads/tangles are rarely that big, they might have ~15 comments and ~30 reactions. So realistic that big threads are about 50 messages large, 50msg * 8bytes = 400 bytes.

Notice that this approach is also probabilistic, because the chance of hash collisions is larger when we pick only a few of the first bytes. But SSB should be made for small-world use cases where these collisions are rare, where you don't have thousands of messages from a single feed, and you don't have thousands of friends. It simplifies the algorithms, because you don't need balanced search trees, you don't need cryptographically secure fingerprinting schemes, etc.

arj03 commented 1 year ago

Another related work: https://martin.kleppmann.com/2020/12/02/bloom-filter-hash-graph-sync.html

gpicron commented 1 year ago

@staltz I was not aware of Aljoscha's work until Basel P2P. And I hadn't time yet to investigate all its implications. My proposal is not the result of a thesis work and I didn't search extensively the state of the art. Simply, I found the Bloom Clock paper and though it can be an efficient approach to reduce the complexity and increase the efficeincy of the replication/converge problem in SSB context. As in my day job I work a lot with probabilistic DS and compressed integer lists DS, I rapidly found a way to encode efficiently BC for efficient comparisons and intersections (for those if you are interested Daniel Lemire investigated a lot those). Then I derive a potential protocol.

I think the insight of Aljoscha would be very useful. I only had a few 10's of minutes to explain him the idea and but we didn't had the opportunity to brainstorm together and exchange further because I had to leave.

gpicron commented 1 year ago

Here's a wild thought: why do we need to send a compressed fingerprint over the wire? Bloom filters or bloom clocks or (Aljoscha's) hashed fingerprints with balanced search trees etc. What if we just sent an array of the message IDs within that range? Even better, let's send the first 4 bytes (or whatever other truncation) of each message ID. This isn't a lot of data!

Suppose that Alice is "200 messages behind" Bob when it comes to replicating Carol's feed. 200 messages is a stretch, that means that Carol has posted a lot, or Alice has been offline for quite some time. Most cases will have less than 200, easily. With 4 bytes for each message hash, that's just 800 bytes, not even 1 kB. Okay, depending on how you measure this it may be different, maybe half a kilobyte, maybe 3 kB. But still, not prohibitive.

I am curious about implementing this naive approach, measuring performance, and seeing what kind of compromises we can do. Message hashes of a single feed are very often unique enough with just 4 bytes, so that's probably going to work when replicating a DAG feed. When replicating a tangle/thread consisting of many authors, then we probably need 4 bytes for the author ID and 4 bytes for the message, so 8 bytes for each msg. But threads/tangles are rarely that big, they might have ~15 comments and ~30 reactions. So realistic that big threads are about 50 messages large, 50msg * 8bytes = 400 bytes.

Notice that this approach is also probabilistic, because the chance of hash collisions is larger when we pick only a few of the first bytes. But SSB should be made for small-world use cases where these collisions are rare, where you don't have thousands of messages from a single feed, and you don't have thousands of friends. It simplifies the algorithms, because you don't need balanced search trees, you don't need cryptographically secure fingerprinting schemes, etc.

Sending a list of truncated message ids is just another "compression" option. But the issue, I think, is that it's size is proportional to the size of the list. This naive approach may work, but only if you are sure that this list size will never explode. The value of probabilistic DS and Trees is that they are either constant size or log(N) size.

I know you are focused on mobile p2p exchanges, but mobile devices will also connect to "server" and "desktop" devices that will be probably open to store much more content and replicate more history. And those devices will also connect to each others, so potentially exchanging large lists.

So, in general, what I saw in other contexts in day job (which is mainly related to large distributed dataset) is an hydrid approach: if list is small enough, send the list and else send a probabilistic struct (bloom inspired, cuckoo hashes, Merkle tree inspired) to bound the maximum size of the data transferred and stored. In big data world, our ennemies is unbounded IO in general, both on disk and on network, and transcoding of data before exploitation. That's probably why I'm a bit skewed and I focus a lot of size controlled DS and encoding of data that permit in place operations with transformation. (and also sometime on every single byte per entry counts and need to bring enough value. Because, in cases, it is often multiplied by billions). But I think most of these concerned can be transposed to p2p and edge computing worlds.

staltz commented 1 year ago

Yes, that's very reasonable @gpicron. We should consider and defend against unbounded I/O, but in my opinion that's already the case in SSB with the actual feed messages, not the fingerprint during reconciliation. I think the actual content transmitted is more concerning than the reconciliation fingerprint.

Currently it's possible to break SSB by intentionally crafting an enormous feed and then befriending it in the main network. Due to hops, this can potentially break a lot of peers.

In my opinion we should address that kind of problem before we address unbounded fingerprint sizes.

staltz commented 1 year ago

Note also that we could start with the naive approach, and then later upgrade the system to a bounded/compressed fingerprint approach if it gains us more than it burdens us. I really value shipping code early and seeing how it behaves in the real world. In our case with SSB2, the "shipping" consists of making experiments such that we can run test networks between ourselves and measure results.

gpicron commented 1 year ago

Another related work: https://martin.kleppmann.com/2020/12/02/bloom-filter-hash-graph-sync.html

Actually that very similar to my approach except they use standard blooms instead of counting blooms. Make me think that maybe there is not such much added value in using Counting Blooms. I need to reanalyse the consequence. In case it has no added value, then I would revise but maybe not with bloom filter but cuckoo filter instead. The cuckoo filter have some interesting properties. Somewhat you can see it like "list of truncated hashes" like @staltz proposed earlier where we could regulate the "truncation" parameter dynamically to bound the size.

Currently it's possible to break SSB by intentionally crafting an enormous feed and then befriending it in the main network. Due to hops, this can potentially break a lot of peers.

I agree. And I think that something you can solve with the anchors while preserving the eventual consistency property of the chains. Add to the Anchors the policy that size of the messages linked to it cannot go over some threshold and one cannot craft anymore such a feed. As soon as it will try to, it will be blocked by the replication... To be less restrictive, we can imagine slowing transmission as "punishment" instead immediately blocking.

Note also that we could start with the naive approach, and then later upgrade the system to a bounded/compressed fingerprint approach if it gains us more than it burdens us. I really value shipping code early and seeing how it behaves in the real world. In our case with SSB2, the "shipping" consists of making experiments such that we can run test networks between ourselves and measure results.

+++ ;-) Note there are some interesting simulators framework for DTN networks (and basically SSB is a DTN network whatever the feed format). A good resource https://github.com/dtn7/awesome-dtn. I will try to look a bit at those.

staltz commented 1 year ago

(This reply was written before I saw your latest comment)

Another related work: https://martin.kleppmann.com/2020/12/02/bloom-filter-hash-graph-sync.html

Wonderful, thanks @arj03 ! This convinced me we should probably use Bloom filters after all, the FP rate is much lower than my naive idea of sending truncated hashes. I also really like Martin's algorithm because it relies on the (always correct) "no" answers, and after a few roundtrips (perhaps 3?) you are very very sure you have reconciled.

I feel like we have pseudocode already:

  1. Determine "anchors" / "checkpoints" in the DAG
  2. Tell the remote peer your latest (sorted) 10 checkpoints
  3. Receive from the remote peer their latest (sorted) 10 checkpoints
  4. Find the "most recent common checkpoint" as the starting point for the range
  5. If there is no common checkpoint,
    • And if I'm behind, I drop the whole DAG locally and download the remote peer's DAG instead
    • And if they are behind, they should drop their whole local DAG and I will upload my whole DAG
    • Then conclude this algorithm
  6. Else if there is a common checkpoint, then create a slice of the DAG from that checkpoint and forwards
  7. Create a bloom filter for my DAG slice
  8. Send the bloom filter, receive the remote peer's bloom filter
  9. For every msg in my DAG slice that is not in the remote peer's bloom filter, upload it
  10. Similarly, receive msgs from the remote peer, and persist them to disk
  11. Create a bloom filter for my DAG slice MINUS the messages I already sent to the remote peer
  12. Go back to step 8, until step 11 is an empty bloom filter
staltz commented 1 year ago

And I think that something you can solve with the anchors while preserving the eventual consistency property of the chains. Add to the Anchors the policy that size of the messages linked to it cannot go over some threshold and one cannot craft anymore such a feed. As soon as it will try to, it will be blocked by the replication... To be less restrictive, we can imagine slowing transmission as "punishment" instead immediately blocking.

@gpicron I fully agree!

staltz commented 1 year ago

Go back to step 8, until step 11 is an empty bloom filter

Actually, I don't think an empty bloom filter would appear, so this would just infinitely loop. For instance, take the case where Bob and Alice are already synchronized, so their bloom filters are going to be non-empty, but they also won't send any missing messages to each other over the wire.

But another solution for this is that we could use slightly different bloom filters on each of the iterations.

The way I understand bloom filters, is that given an item a, it produces (via a hash function) the positions 1, 3, 4, 8 and those positions are then bit-flipped to 1 in the bloom filter. On the next iteration, we could produce different positions using a different hash function, say a becomes 2, 3, 5, 6. What this means is that while the FP rate is still the same per iteration, it means that globally it becomes less likely that a is the item with a false positive on all iterations.

staltz commented 1 year ago

Damn it Dominic, why do you show up everywhere I go? https://github.com/cry/jsbloom/commits?author=dominictarr

gpicron commented 1 year ago

I think we should use Cuckoo Filter instead of BloomFilter. Properties are nearer to the Counting Bloom Filters and I think it will easier for you to resonate about it in your pseudo code. You can just consider them like a list of truncated ids.
https://github.com/huydhn/cuckoo-filter is some efficient scalable cuckoo filter implementation that I used in the past.

I can rapidly transpose the code to Nim, so that you will have a JS library and I have a native library...

staltz commented 1 year ago

@gpicron What do you think about varying the cuckoo-internal hash over our iterations? This library you shared has a non-customizable hash function https://github.com/huydhn/cuckoo-filter/blob/34af8ade297d7aaa88fdd998c3b589546d5b0cef/cuckoo/filter.py#L139-L142

gpicron commented 1 year ago

Introducing customizable hash is not a problem. The most interesting of the library I share is the Scalable which allow self adapt to the size of the list to maintain a bounded FP rate with the minimum size footprint.

staltz commented 1 year ago

Oh, silly me, we don't need a customizable hash function, we just need to change the input. When adding a to the filter, we just need to append it with the iteration number. I.e. insert a1 in the bloom filter on the first iteration, insert a2 in the (new) bloom filter on the second iteration, etc.

staltz commented 1 year ago

I decided I want to first build https://github.com/ssbc/minibutt-spec/issues/14 so that it's a flexible foundation on which to build these replication experiments. As is, minibutt (feed format) and DAG sync involve quite a lot of ssb-db2 kung fu skills, and we may save time if we experiment these ideas on another kind of database.

staltz commented 1 year ago

Okay, I have ssb-memdb in an okay shape to start some experiments.

Important difference between ssb-memdb and ssb-db2 is a small tweak in ssb.db.add that allows the first message of a feed to be any message. In ssb-db2 when you call ssb.db.add for the first time for a feed, it validates that it MUST have sequence=1. In ssb-memdb, the 1st message is inserted with OOO and the remaining are inserted with normal validation. This is to enable sliced replication.

I started writing a dagsync proof of concept here: https://github.com/staltz/dagsync

See the test that replicates a classic feed in sliced replication: https://github.com/staltz/dagsync/blob/e756bab8ba28329df0ae1723fcad8a7058f12134/test/classic-sliced.js

I realized that it's easier to start experiment with just DAG sync on classic feed format, because minibutt is more about solving "nested feeds" (subfeed replication) and same-as (one account for all devices), there is nothing yet in minibutt that is new when it comes to sliced replication.

Next up, I'll try doing DAG sync algorithm on a thread tangle.

And then after that I'll convert the DAG sync APIs into a duplex stream. Right now it's just manual request-response between Alice and Bob.


One thing I learned about the pseudocode is that we shouldn't upload the missing message as soon as we notice it's not in the remote peer's bloom filter. Two reasons for that: (1) when doing sliced replication we want to be sure there are no holes, so given a range of msg sequences, we want to be sure we get ALL of the messages in that range, but due to Bloom filter false positives, we could miss some, (2) thus we need to do multiple iterations of bloom filter checking before we add any message, to make sure all the messages are added in order without holes.

The updated pseudocode would look like:

  1. Determine "checkpoints" in the DAG
  2. Tell the remote peer your latest (sorted) 10 checkpoints
  3. Receive from the remote peer their latest (sorted) 10 checkpoints
  4. Find the "most recent common checkpoint" as the starting point for the range
  5. If there is no common checkpoint,
     - And if I'm behind, I drop the whole DAG locally and download the remote peer's DAG instead
     - And if they are behind, they should drop their whole local DAG and I will upload my whole DAG
     - Then conclude this algorithm
  6. Else if there is a common checkpoint, then create a slice of the DAG from that checkpoint and forwards
- 7. Create a bloom filter for my DAG slice
+ Let i = 0
+ 7. Create a bloom filter (using iteration number `i`) for my DAG slice
  8. Send the bloom filter, receive the remote peer's bloom filter
- 9. For every msg in my DAG slice that is not in the remote peer's bloom filter, upload it
+ 9. For every msg in my DAG slice that is not in the remote peer's bloom filter, send its msg *ID*
-10. Similarly, receive msgs from the remote peer, and persist them to disk
+10. Similarly, receive msg *IDs* from the remote peer, and "ghost add" it to my DAG slice
-11. Create a bloom filter for my DAG slice MINUS the messages I already sent to the remote peer
+11. If i < 4, increment i and go back to step 7, else go to step 12
-12. Go back to step 8, until step 11 is an empty bloom filter
+12. Ask for the remote peer to give me all the msgs which I "ghost added"

There are 3 clear phases in the algorithm:

  1. Determining the common range
  2. Bloom filter iterations to exchange msg IDs
  3. Materializing msg IDs into msgs
staltz commented 1 year ago

Here is the general algorithm in JS, now working for both tangle ("thread") sync and sliced feed replication: https://github.com/staltz/dagsync/blob/ea0708a60b99beddfb472a45b8fd7a7895ad7e6d/algorithm.js

gpicron commented 1 year ago

idea to avoid rebuilding BC in iteration

Maintaining the BF in one map/array per peer like in EBT the version clock per peer. But instead of one slot per pair (feedID, last sequence) you use (peer bf, my bf) for all agreed common DAG, Each time you receive a message add to your bf (and update the your bf for other peer too if needed) and each time you send a message from peer update the peer bf. each time after update of one of the BF, then check if bf are equal. If yes, check the DAG for gaps. If some, send a kind NACK listing the gaps to the peer (per should send you the missing) (note if due to FP, that will not upgrade anymore the BF). When you have received all the missing one, send the message id (hash) of each heads per DAG and send it to the peer ( he is doing the same normally) . Only then if they don't match you may need to reiterate with a (maybe larger) bloom filter but you can exclude the DAG that we're complete.

staltz commented 1 year ago

Okay, I now implemented a streaming protocol, such that both peers can get updates from each other.

See test suite for a simple explanation of use cases: https://github.com/staltz/dagsync/tree/master/test See streaming implementation, which follows the diagram below: https://github.com/staltz/dagsync/blob/master/lib/stream.js

sequenceDiagram
  participant A as Alice
  participant B as Bob
  A->>B: send local range for ID
  B->>A: send common range for ID
  A->>B: send bloom round 0
  B->>A: send bloom round 0 and missing round 0 msg IDs
  A->>B: send bloom round 1 and missing round 0 msg IDs
  B->>A: send bloom round 1 and missing round 1 msg IDs
  A->>B: send bloom round 2 and missing round 2 msg IDs
  B->>A: send missing round 2 msg IDs
  A->>B: send missing msgs
  B->>A: send missing msgs

This is still missing the corner cases when there is no common range ("step 5").


idea to avoid rebuilding BC in iteration

What is "BC"?

@gpicron I'll go through your suggestions carefully, later. I do agree with you and believe there are ways of sending even less data back and forth.

gpicron commented 1 year ago

BF not BC sorry. (Bc is for Bloom Clock...)

Note : the reasons to avoid looping with BF are mainly because it resonates with a personal past experience of infinite loop... but I didn't look at your code in detail...

note 2: now I remember I was planning to use BC instead of BF: because it detects those gaps. False positive can happen only on full branches that would produce the same end bloom clock stamp and therefore you only need to check the tips when they appears in sync

staltz commented 1 year ago

idea to avoid rebuilding BC in iteration

Maintaining the BF in one map/array per peer like in EBT the version clock per peer. But instead of one slot per pair (feedID, last sequence) you use (peer bf, my bf) for all agreed common DAG, Each time you receive a message add to your bf (and update the your bf for other peer too if needed) and each time you send a message from peer update the peer bf. each time after update of one of the BF, then check if bf are equal. If yes, check the DAG for gaps. If some, send a kind NACK listing the gaps to the peer (per should send you the missing) (note if due to FP, that will not upgrade anymore the BF). When you have received all the missing one, send the message id (hash) of each heads per DAG and send it to the peer ( he is doing the same normally) . Only then if they don't match you may need to reiterate with a (maybe larger) bloom filter but you can exclude the DAG that we're complete.

@gpicron This is an optimization applicable just internally in a peer, correct? That is, to avoid rebuilding the BF? I think there's an obstacle to doing that: the "items" inserted into the BF are prefixed with the round ('iteration') number, so I cannot build remote peer's round 2 BF given their round 1 BF.

So if it's not a change in the protocol over the network, I don't think we need to worry about this now. It can be optimized later.

Note : the reasons to avoid looping with BF are mainly because it resonates with a personal past experience of infinite loop... but I didn't look at your code in detail...

I actually went with a fixed amount of rounds: 3. So there's no risk of infinite loop.

note 2: now I remember I was planning to use BC instead of BF: because it detects those gaps. False positive can happen only on full branches that would produce the same end bloom clock stamp and therefore you only need to check the tips when they appears in sync

Wouldn't a sufficient number of BF rounds also expose these gaps? It's also easier to find Bloom Filter implementations ready, in comparison to Bloom Clock implementations. By the way, about the number of bloom rounds needed, I think we can have a fixed number of rounds, while the dynamic variable can be the size of the bloom filter. We just need to calculate the "compound" false positive rate, which is the probability of getting false positives on ALL rounds.

staltz commented 1 year ago

Open Question

While I'm happy with DAG sync as it is now, I think we still need to address the following scenario:

gantt
    title Carol's feed
    dateFormat  YYYY-MM-DD
    axisFormat %W
    section Alice
    Has :2022-01-01, 50w
    section Bob
    Has:b1, 2022-01-01, 3w
    Does NOT want:crit, b2, after b1, 44w
    Wants:done, after b2, 3w

How should Bob and Alice communicate such that (1) Bob knows he's about to receive msgs 48–50, (2) deletes msgs 1–3 before receiving the new msgs, (3) Alice sends msgs 48–50?

staltz commented 1 year ago

Updated pseudocode

Updating the pseudocode/protocol with the latest implementation proof of concept. A "range" is a simple tuple [min, max] that defines a region of the "bundle". For instance, for linear feed formats, the range is the [minSequence, maxSequence] where sequence is the increasing-number on the message field metadata. For threads, the range is [timestamp of root msg, timestamp of latest reply minus timestamp of root msg].

An "empty range" is whenever [min, max] is such that min > max.

sequenceDiagram
  autonumber
  participant A as Alice
  participant B as Bob
  note over A: I want to sync bundle "ID"
  A->>B: Send local range for ID
  opt Received range is empty
        B->>A: All msgs in my range
        note over A: done
  end
  Note over B: Calculate common range based on<br/>my local range and received range
  B->>A: Send common range for ID
  opt Common range is empty
        A->>B: All msgs in my range
        note over B: done
  end
  A->>B: Send BF for round 0
  B->>A: Send BF for round 0 and A's round 0 missing msg IDs
  A->>B: Send BF for round 1 and B's missing round 0 msg IDs
  B->>A: Send BF for round 1 and A' missing round 1 msg IDs
  A->>B: send BF for round 2 and B's missing round 2 msg IDs
  B->>A: Send A's missing round 2 msg IDs
  A->>B: Send B's missing msgs
  B->>A: Send A's missing msgs
staltz commented 1 year ago

I have an idea to solve the "Open Question"!

So far, the concept of "range" has meant "range of messages that I have for this bundle", but we also need to signal the "range of messages that I want for this bundle". Let's then differentiate 3 types

So it seems like we don't need to calculate the "common range", we can just calculate the "give-range" as the intersection of their "want-range" with my "have-range". Then, all my BF calculations will be done on the give-range.

Sounds like it could work!

gpicron commented 1 year ago

idea to avoid rebuilding BC in iteration Maintaining the BF in one map/array per peer like in EBT the version clock per peer. But instead of one slot per pair (feedID, last sequence) you use (peer bf, my bf) for all agreed common DAG, Each time you receive a message add to your bf (and update the your bf for other peer too if needed) and each time you send a message from peer update the peer bf. each time after update of one of the BF, then check if bf are equal. If yes, check the DAG for gaps. If some, send a kind NACK listing the gaps to the peer (per should send you the missing) (note if due to FP, that will not upgrade anymore the BF). When you have received all the missing one, send the message id (hash) of each heads per DAG and send it to the peer ( he is doing the same normally) . Only then if they don't match you may need to reiterate with a (maybe larger) bloom filter but you can exclude the DAG that we're complete.

@gpicron This is an optimization applicable just internally in a peer, correct? That is, to avoid rebuilding the BF? I think there's an obstacle to doing that: the "items" inserted into the BF are prefixed with the round ('iteration') number, so I cannot build remote peer's round 2 BF given their round 1 BF.

So if it's not a change in the protocol over the network, I don't think we need to worry about this now. It can be optimized later.

Note : the reasons to avoid looping with BF are mainly because it resonates with a personal past experience of infinite loop... but I didn't look at your code in detail...

I actually went with a fixed amount of rounds: 3. So there's no risk of infinite loop.

note 2: now I remember I was planning to use BC instead of BF: because it detects those gaps. False positive can happen only on full branches that would produce the same end bloom clock stamp and therefore you only need to check the tips when they appears in sync

Wouldn't a sufficient number of BF rounds also expose these gaps? It's also easier to find Bloom Filter implementations ready, in comparison to Bloom Clock implementations. By the way, about the number of bloom rounds needed, I think we can have a fixed number of rounds, while the dynamic variable can be the size of the bloom filter. We just need to calculate the "compound" false positive rate, which is the probability of getting false positives on ALL rounds.

Generally speaking, I think the problem is that you want to treat the BF struct in a deterministic; Those proba/succinct structures are "accelerators". You must think at it as a "step" inserted in a protocol that would work without it. The reason to insert it to accelerate/reduce the data exchanges in 99,99% of cases. So, I don't think it is a good idea to make several rounds. My advise: Make one round, then fallback to more deterministic step.

image

Here is a state machine diag to try to clarify what I mean.

gpicron commented 1 year ago

I have an idea to solve the "Open Question"!

So far, the concept of "range" has meant "range of messages that I have for this bundle", but we also need to signal the "range of messages that I want for this bundle". Let's then differentiate 3 types

  • Have-range
  • Want-range
  • Give-range

So it seems like we don't need to calculate the "common range", we can just calculate the "give-range" as the intersection of their "want-range" with my "have-range". Then, all my BF calculations will be done on the give-range.

Sounds like it could work!

Exactly.

Note: I still think the "range" make things more complex than needed in practice. Only the "from" set make sense I think in practice.

In initiation phase, the peers send to each others their "from" of interest. (it means they want all from that points minus what they will say they have just after) Then when they send their BF, they tell the other what they have yet from that "from" points.
So there is no usage to specify a "to". The "range" sync idea was to fill gaps due to previous incomplete syncs, but that's not needed with the "from" + "BF of what I know yet"

staltz commented 1 year ago

I think the problem is that you want to treat the BF struct in a deterministic; Those proba/succinct structures are "accelerators". You must think at it as a "step" inserted in a protocol that would work without it. The reason to insert it to accelerate/reduce the data exchanges in 99,99% of cases. So, I don't think it is a good idea to make several rounds. My advise: Make one round, then fallback to more deterministic step.

@gpicron What deterministic step are you suggesting? I understood the first half of your diagram, but Exchange Phase 2 and 3 are not clear for me. I don't think we need a deterministic step, as long as we make the "compound" false positive probabilities really really low. Just like Martin Kleppmann suggested, we just need to have multiple BF rounds, and the probability reduces exponentially. I think this is preferable because it simplifies the implementation, and because we are anyway always dealing with probabilities in SSB. For instance, we have always treated message IDs as deterministic, but they too are probabilistic: it is possible to get hash collisions and thus two different messages would have the same ID. However, you need huge scale to hit these cases, and we can rely on SSB being small scale use cases.

In initiation phase, the peers send to each others their "from" of interest. (it means they want all from that points minus what they will say they have just after) Then when they send their BF, they tell the other what they have yet from that "from" points. So there is no usage to specify a "to". The "range" sync idea was to fill gaps due to previous incomplete syncs, but that's not needed with the "from" + "BF of what I know yet"

If you try to apply this idea to the "open question" example I gave, how would it work? Bob's "from" would be 0 – Infinity, and then Bob would send his BF signalling that he has 0 – 3, but Alice doesn't know that Bob only wants the latest 3 messages, so Alice is going to give him msgs 4 – 50, but that's not what he wants. He wants only 48 – 50.

In other words, Bob's "from" of interest is anything, but that doesn't mean he wants everything.

staltz commented 1 year ago

Updated the protocol to have want-range and have-ranges, it didn't add more round trips in the default case, only in the corner case when Alice (initiator) has an empty have-range.

sequenceDiagram
  autonumber
  participant A as Alice
  participant B as Bob
  note over A: I want to sync bundle "ID"
  A->>B: Send local have-range for ID
  opt Alice's have-range is empty
      B->>A: Send local have-range and (empty) want-range for ID
      A->>B: Send local want-range for ID
      B->>A: All msgs in remote want-range
      note over A: done
  end
  Note over B: Calculate want-range based on my<br/>local have-range and remote have-range
  B->>A: Send local have-range and want-range for ID
  opt Bob's have-range is empty
        A->>B: All msgs in remote want-range
        note over B: done
  end
  A->>B: Send local want-range and BF for round 0
  B->>A: Send BF for round 0 and A's round 0 missing msg IDs
  A->>B: Send BF for round 1 and B's missing round 0 msg IDs
  B->>A: Send BF for round 1 and A' missing round 1 msg IDs
  A->>B: send BF for round 2 and B's missing round 2 msg IDs
  B->>A: Send A's missing round 2 msg IDs
  A->>B: Send B's missing msgs
  B->>A: Send A's missing msgs
staltz commented 1 year ago

Good news is that by operating on these want-ranges instead of the "common range" (as before), the ranges are tighter, which means there are less messages to go through, and the bloom filter can be smaller.

gpicron commented 1 year ago

@gpicron What deterministic step are you suggesting? I understood the first half of your diagram, but Exchange Phase 2 and 3 are not clear for me. I don't think we need a deterministic step, as long as we make the "compound" false positive probabilities really really low. Just like Martin Kleppmann suggested, we just need to have multiple BF rounds, and the probability reduces exponentially. I think this is preferable because it simplifies the implementation, and because we are anyway always dealing with probabilities in SSB. For instance, we have always treated message IDs as deterministic, but they too are probabilistic: it is possible to get hash collisions and thus two different messages would have the same ID. However, you need huge scale to hit these cases, and we can rely on SSB being small scale use cases.

First remark: I know you are focused on the Mobile case with hard limit on memory/storage dedicated. But the design should scale to large limit for server and desktop which have enough cpu and ram. Or this is not anymore SSB but Manyverse-mobile-only protocol.

Why not looping:

In practice, Phase 2 and Phase 3 are simple:

Phase 2: once the BF are aligned (end of Phase 1), peers look at their stored DAG's for gaps in sequences. If there are some, which should rarely happen, send a message (NACK) with the list of missing message detected and so asking specifically those.

Phase 3: Once Phase 2 is done (peers completed the gap in sequences), there is still a (very small) probability that you have 2 different branches in each peer that have produced the same BF but are different (the longer the branches the highest the probability). So exchanging the branch tips (hash of the head of each branch) is fast way to determine that case. You need 32 bytes per branch. If the set of common DAG's is small enough it can be transmitted as a simple array. For scalability we can merkelize for large set of common DAG's that generate more that 2 or 3 TCP packets.

gpicron commented 1 year ago

If you try to apply this idea to the "open question" example I gave, how would it work? Bob's "from" would be 0 – Infinity, and then Bob would send his BF signalling that he has 0 – 3, but Alice doesn't know that Bob only wants the latest 3 messages, so Alice is going to give him msgs 4 – 50, but that's not what he wants. He wants only 48 – 50.

In other words, Bob's "from" of interest is anything, but that doesn't mean he wants everything.

Ok, I see you have definitively abandoned the check of previous link chain validation...

Question: What is last 3 messages for a forked feed ?

ex 1:

* A (ts:10, seqNum: 8) 
* B (ts:8, seqNum: 7) 
| * C (ts:9, seqNum: 7) 
|/  
* D (ts:5, seqNum: 6) 
* E (ts:4, seqNum: 5) 
* F (ts:3, seqNum: 4) 
* G (ts:2, seqNum: 3) 
* H (ts:1, seqNum: 2) 

ABC (based on seqnum and capped by number of messages) ? ACB (based on timestamps) ? ABDC (based on seqnum only )?

another funny one:

*   A 
|\  
| * B 
* | C 
|\ \  
| |/  
|/|   
| * D 
|/  
*   E 
|\  
| * F 
|/  
| * G 
| * H 
| * I 
| * J 
| * K
| * L
| | * M
| | * N
| | * O
| |/  
| * P
| * Q
| * R
|/  
*   S
#

What is "last 3" ?

gpicron commented 1 year ago

The last in a diagram to make it simpler to read image

staltz commented 1 year ago

I know you are focused on the Mobile case with hard limit on memory/storage dedicated. But the design should scale to large limit for server and desktop which have enough cpu and ram. Or this is not anymore SSB but Manyverse-mobile-only protocol.

This is important, so let me dedicate more text for it. Mobile support is a litmus test for SSB2, not a goal. SSB2 should support mobile, but it's not about "just supporting mobile".

What I'm looking for as a goal for SSB2 is sustainability, and this means the inverse of scalability. I believe (based on some measurements I made) that most social applications we find will not require that much data. To be generous, with 500MB you have abundant amounts of content to work with. So the 100MB is not the point, the point is whether we can achieve all our social networking needs (plus social-enabled apps, such as games, whiteboarding, note taking, productivity apps etc) with a finite amount of storage. So I am ruling out unbounded datasets as supported use cases. Unbounded datasets are typically the realm of big data, which is the realm of surveillance capitalism. And it would support other SSB principles if we build for humans, not big data companies. Humans use consumer devices, which are bounded. This is the guiding principle for sustainability in SSB2: "degrowth" and recycling/regenerative consumption of resources.

Now to the technical details:

in practice this is difficult to determine how much round you should do at max to get a certain level of trust . This is mathetically actually dependent of the data and the result BF... If you put then some constant limit of rounds at let say 3, that may be a security hole. One may craft DAG's (intentionally or not) without much effort that can fail the eventual consistency promise.

What if there are (for the sake of argument) 20 rounds, would it still be easy to craft a rogue DAG that hits the false positives? My point is not that we should aim for a "compound" false positive rate that is just enough for our needs, my point is to find a "compound" false positive rate that will be sufficient for all imaginable "normal" use cases. So for instance, a compound probability of one in one trillion would be enough for me.

prefixing with the round the keys implies rebuilding constantly the bloom filter. This is just equivalent to changing the hash functions. That's maybe ok with a few hundreds of messages and 2 peers, what about EBT (multi peer) and large apps on server and desktop. That will not scale well.

Good point, remains to be seen how this will perform in a real-world network. Scaling to thousands of messages and hundreds of peers is important. But as I said, elastically scaling beyond that is not important.

Phase 2: once the BF are aligned (end of Phase 1), peers look at their stored DAG's for gaps in sequences. If there are some, which should rarely happen, send a message (NACK) with the list of missing message detected and so asking specifically those.

This doesn't work for thread DAGs, where there are multiple authors, and their messages may have the same sequences, e.g.

graph TB;
  r[root msg sequence=1 by Alice]
  r1[reply msg sequence=1 by Bob]
  r2[reply msg sequence=1 by Carol]
  r-->r1-->r2

There are 3 messages here but they all have the sequence number 1.

Ok, I see you have definitively abandoned the check of previous link chain validation...

I have not, because your replica of someone's feed would never have holes, so there is previous validation among the messages you have locally, just not all the way until sequence 1, because you wouldn't have that msg anymore.

Deleting the past is important to achieve bounded storage (sustainability).

Question: What is last 3 messages for a forked feed ?

I was describing sliced replication (and deletion of old messages) for a specific instance of DAG sync, which is for linear feed formats. Forks are "game over" in linear feed formats, so there's no need to elegantly handle forks in that case.

That's different for e.g. with minibutt which is a DAG feed format, and that's not the subject in question here. Also thread DAGs are not linear, and will NOT have deletion of old portions of the thread.

gpicron commented 1 year ago

I will not continue to argue much longer, we are not having the same target for SSB2

The thread was about DAG sync, aka feed that can fork, threads and other tangles.
If you now focus non-forkable feeds, drop chain verification to the root and sustainability, why not just modify EBT to transfer messages from some unix timestamp per feed (instead of the last know sequence number per feed) ? Peers send the list of (feed, wanted over timestamp) at connection and then other transmit all it has over that timestamp. Straightforward, using the classic format and can be rapidly implemented.

staltz commented 1 year ago

Because DAG sync is going to support all these:

Not one of these. All.