ssbc / ssb2-discussion-forum

not quite tiny, also not quite large
17 stars 1 forks source link

Tangles everywhere #20

Open staltz opened 1 year ago

staltz commented 1 year ago

Okay, bear with me. :bear: This might be long, but I just had some very exciting thoughts. I'll use the word "tangle" here to mean the same thing as "DAG".

Single-author single-device tangle

This is SSB at the moment

graph RL
E-->D-->C-->B-->A
classDef default fill:#6df,stroke:#fff0

BLACK edges are feed "previous".

PUBLISHING: the new message MUST have previous = hash of the tip of the tangle so far (and there is only ONE tip).

VERIFICATION: when a new message is received, it MUST have previous = hash of the tip of my locally-stored tangle so far.

Single-author multi-device tangle

This is vaguely what minibutt would be

graph RL
D1-->C-->B-->A
D2-->C
D3-->C
E-->D1 & D2
classDef default fill:#6df,stroke:#fff0

BLACK edges are feed "previous".

PUBLISHING: the new message MUST have previous = array of hashes of all tips of the tangle so far (there can be many).

VERIFICATION: when a new message is received, then every hash in previous array must refer to a msg locally-known in the tangle so far. Doesn't have to be tips.

:star: This is generalization of "single-author single-device tangle".

Multi-author multi-device tangle

This is what "discussion threads" could be

graph RL
E-->D-->C-->B-->A
linkStyle 0,1,2,3 stroke:#f00,stroke-width:3;
C2[C]-->C
B2[B]-->A
linkStyle 4,5 stroke:#f00,stroke-width:3;
C2-->B2

B2:::bob
C2:::bob
A:::alice
B:::alice
C:::alice
D:::alice
E:::alice

classDef alice fill:#6df,stroke:#fff0
classDef bob fill:#6f6,stroke:#fff0

Blue is one author, green is another author.

BLACK edges are feed "previous".

RED edges are "replies".

PUBLISHING: the new message MUST have replies = array of hashes of all tips of the RED tangle so far.

VERIFICATION: when a new message is received, then every hash in replies array must refer to a msg locally-known in the RED tangle so far. Doesn't have to be tips.

Here's an important realization I had:

:bulb: Suppose that I follow the blue author, thus I get all their messages, but I don't follow the green author. Notice that there is a missing green "A" msg. That's because we're replicating the green author out-of-order (OOO), which means we are only verifying that the message is signed by green author, nothing else. So we completely ignore the black edges, i.e. the previous field for green msgs. However, we CAN verify the red edges, i.e. the replies field, so we SHOULD verify those backlinks! Because we have all the messages in the red tangle necessary for validation. So in that sense, there is nothing special about the black edges (previous) versus the red edges (replies), all of them serve the purpose of causally ordering msgs!

Feed format to make tangles native.

The above (and recent conversations about putting tangle data in the metadata portion of a msg, not the content portion) inspired me to look for a new way of describing backlinks. Since a tangle is identified by its root, I considered putting the root ID as the "tangle name", and the value is an array of causally preceding nodes.

type Msg = {
  metadata: {
    author: string, // e.g. "@QlCT..."
    type: string, // e.g. "post"
    timestamp: number,
    tangles: {
      "@QlCT...": Array<MsgId>, // NOTICE THIS!
      "%coF8...": Array<MsgId>, // NOTICE THIS!
    },
    contentHash: string,
    contentSize: number,
  },
  content: any,
  signature: string,
}

Notice

    tangles: {
      "@QlCT...": Array<MsgId>, // NOTICE THIS!

This is the new previous. Notice that the key for this tangle is my author ID. This means that I am providing edges for the tangle known as "@QlCT..." which is my "feed". The array contains backlinks to the previous messages.

Similarly, the tangle "%coF8...": Array<MsgId> is for some discussion thread.

So if I am replicating the tangle "@QlCT..." (in other words, if I am replicating this feed), then I will verify the backlinks in "@QlCT...": Array<MsgId>.

But if I am replicating the tangle "%coF8..." (in other words, I am replicating a single SSB discussion thread), then I will ignore the other tangles and only verify the backlinks in "%coF8...": Array<MsgId>.

This provides a generalized way of describing any tangle and how to verify their backlinks. Tangles should be identified by their root, not by a human-friendly name.

staltz commented 1 year ago

Tangles + lipmaa + deletes

After the lunch break now. Put on your seat belts, because I'm gonna mix everything together.

When maintaining a single-author tangle (a "feed"), we need sliced replication. See #19 as an example. We need #18, sliced replication. One simple system is to have a some "anchors" which are the black nodes below.

---
title: Regularly space "anchors"
---
graph RL
J-->I-->H-->F & G
F & G-->E-->D
D-->C & B
B & C-->A

A:::rib
E:::rib
I:::rib

classDef default fill:#6df,stroke:#fff0,color:#000
classDef rib fill:#000,stroke:#fff0,color:#fff

For now let's just suppose you can neatly figure out these anchors. If you can, then "replication granularity" is defined by anchors. You either replicate the range [A, ∞] or [E, ∞] or [I, ∞]. A peer doing sliced replication can inform that they are only interested in replicating from anchor E onwards, for instance.

Suppose that you replicate only [E, ∞]. Then it looks like this:

graph RL
J-->I-->H-->F & G
F & G-->E-->D
D-->C & B
B & C-->A
linkStyle 6,7,8,9,10 stroke:#0002,stroke-width:2;

A:::ribw
B:::defaultw
C:::defaultw
D:::defaultw
E:::rib
I:::rib

classDef default fill:#6df,stroke:#fff0,color:#000
classDef defaultw fill:#6df2,stroke:#fff0,color:#0003
classDef rib fill:#000,stroke:#fff0,color:#fff
classDef ribw fill:#0002,stroke:#fff0,color:#fff

In the messages, you would have one tangle per anchor, with the distinction that they share a lot of edges in common, because they are "overlapping tangles". For instance, the message J would be

J = {
  metadata: {
    ...
    tangles: {
      "A": [I],
      "E": [I],
      "I": [I],
    },
  },
  ...
}

So if I only want to replicate [E, ∞], then I pay attention to the tangle field E in the messages, and once I hit the bottom, I stop validating.

We can do better

One problem with this is that the more anchors we have, the more tangle fields we're going to have, and they will all have the same information, because they are overlapping. They only differ by the "stop" point. We have an unbounded number of overlapping tangles! :warning:

One thing we could do is forget old anchors by construction. That means that J would look like this (notice no more A anchor):

J = {
  metadata: {
    ...
    tangles: {
      "E": [I],
      "I": [I],
    },
  },
  ...
}

This could be good enough! The author of the messages can limit how much from the past can you replicate. Maybe (say) 5 anchors is enough, and then 6 or older are just forgotten.

However, this is a production-side mechanism for replication granularity. Ideally we would want consumption-side replication granularity. That is, the peer fetching the data can decide how much data they want. If they want everything, they should be able to get everything! (provided there is availability in the network)

We can do better

Here is where we can do something lipmaa-style.

graph RL
J-->I-->H-->F & G
F & G-->E-->D
D-->C & B
B & C-->A
%% linkStyle 6,7,8,9,10 stroke:#0002,stroke-width:2;
I-->E-->A
linkStyle 11 stroke:#f00,stroke-width:3;
linkStyle 12 stroke:#f00,stroke-width:3;

A:::rib
E:::rib
I:::rib

classDef default fill:#6df,stroke:#fff0,color:#000
classDef defaultw fill:#6df2,stroke:#fff0,color:#0003
classDef rib fill:#000,stroke:#fff0,color:#fff
classDef ribw fill:#0002,stroke:#fff0,color:#fff

Now we introduce a non-overlapping tangle, the red one, that does backlinks further back in the past. J would now look like:

J = {
  metadata: {
    ...
    tangles: {
      "A": [I]
      "E": [I],
      "I": [I],
    },
  },
  ...
}

And I is

I = {
  metadata: {
    ...
    tangles: {
      "A": [E],
      "E": [H, E],
    },
  },
  ...
}

And E is

E = {
  metadata: {
    ...
    tangles: {
      "A": [D, A],
    },
  },
  ...
}

The pattern here is: when an anchor message is published ("E"), its tangle data to the previous anchor ("A") should contain the tips of the tangle ("D") AND the previous anchor itself ("A"). So E = {.. tangles{ "A": [D, A], ...}.

So now if a peer is replicating [E, ∞] they can keep A, E, and delete B, C, D.

graph RL
J-->I-->H-->F & G
F & G-->E-->D
D-->C & B
B & C-->A
linkStyle 6,7,8,9,10 stroke:#0002,stroke-width:2;
I-->E-->A
linkStyle 11 stroke:#f00,stroke-width:3;
linkStyle 12 stroke:#f00,stroke-width:3;

A:::rib
B:::defaultw
C:::defaultw
D:::defaultw
E:::rib
I:::rib

classDef default fill:#6df,stroke:#fff0,color:#000
classDef defaultw fill:#6df2,stroke:#fff0,color:#0003
classDef rib fill:#000,stroke:#fff0,color:#fff
classDef ribw fill:#0002,stroke:#fff0,color:#fff

This way, to validate everything replicated [E, ∞] (= E,F,G,H,I,J), once you replicate E, you can validate the "E" tangle but you can also validate the "A" tangle because you have a backlink to A inside the E msg. And you can ignore the D messages and so forth, because they are not part of the "certificate pool".

Now, if you noticed carefully, we still have an unbounded number of tangle data, the more anchors we have.

But maybe after some coffee I could find out an even more lipmaa-like tangle that allows you to validate everything from the past without keeping everything in the past, and without unbounded number of tangle fields.

staltz commented 1 year ago

Okay, so I sipped my coffee, and now it's time to mix in "DAG Sync" #10 here.

A refresher how DAG sync works:

This helps a lot, in the example in the previous comment, we don't have to replicate [E,∞] by stating "I want to replicate E". We actually say instead that "I want to replicate A" (A is the stable root for this tangle), and E is just part of the have-range. This is great!

This means we just have to have one tangle field per "feed replication" in the messages:

    tangles: {
      "A": Array<MsgId>,
      "other tangles unrelated to feed replication go here"
    },

:heavy_check_mark: "Unbounded tangle fields" problem solved. Now we just need to figure out how to create the skiplinks in a tangle.


Here's a strawman proposal. We need to aim for "logarithmically short paths" for validation, which means we need to do something that looks like exponential jumps.

  1. Every time you publish a message in the tangle, count how many causally preceding nodes there are before this one
    • In the case of E above, there is A,B,C,D, thus the count is 4
  2. If the count == 4, then create a skiplink to the message that had count == 0, i.e. root msg
  3. If the count == 8, then create a skiplink to the message that had count == 4
  4. If the count == 12, then create a skiplink to the message that had count == 8
  5. If the count == 13, then create a skiplink to the message that had count == 4
  6. If the count == 17, then create a skiplink to the message that had count == 13
  7. And so forth

If you paid careful attention, these are the same numbers that are used in the lipmaa sequence. However, here we can't really rely on the count being exactly 4 or exactly 8 etc, because tangles may grow in random increments. Linear feeds always grow (their "count" of preceding msgs) by increments of 1, but because of offline concurrent msgs being published in tangles, msg F and G (see tangle above) have the same count, but then H has a count that is +2 that of G.

So in reality the rule would use >= instead of ==.

Here is a tangle that follows this "lipmaa" >= rule, where each msg shows also what is the count.

graph RL
I-->H-->F & G
F & G-->E-->D
D-->C & B
B & C-->A
I-->E-->A
J-->I & H
K-->J
L-->J
M-->J
N-->K & L & M
O-->N
N-->I & E
P-->N
linkStyle 10 stroke:#f00,stroke-width:3;
linkStyle 11 stroke:#f00,stroke-width:3;
linkStyle 21,22 stroke:#f00,stroke-width:3;

A[A 0]
B[B 1]
C[C 1]
D[D 3]
E[E 4]
F[F 5]
G[G 5]
H[H 7]
I[I 8]
J[J 9]
K[K 10]
L[L 10]
M[M 10]
N[N 13]
O[O 14]
P[P 14]

classDef default fill:#6df,stroke:#fff0,color:#000

Notice the special case with N. Usually lipmaa links in linear logs have 1 "previous" link and 0 or 1 "skiplink". But in this case we didn't have any msg with count exactly equal to 12 (and when N is published, it knows that none of the msgs preceding it had a count = 12), so N had to take up the role of both "skiplink 12" and "skiplink 13".

I think this could work! It might create a few redundant skiplinks sometimes, but it still allows us to delete the past, without losing verification.


Testing out this approach, here is what happens if I want to replicate [I, ∞]. Black msgs are cert pool. Blue messages are "actual content" msgs.

graph RL
I-->H-->F & G
F & G-->E-->D
D-->C & B
B & C-->A
I-->E-->A
J-->I & H
K-->J
L-->J
M-->J
N-->K & L & M
O-->N
N-->I & E
P-->N
linkStyle 10 stroke:#f00,stroke-width:3;
linkStyle 11 stroke:#f00,stroke-width:3;
linkStyle 21,22 stroke:#f00,stroke-width:3;
linkStyle 0,1,2,3,4,5,6,7,8,9,13 stroke:#0002,stroke-width:3;

A[A 0]:::rib
B[B 1]:::defaultw
C[C 1]:::defaultw
D[D 3]:::defaultw
E[E 4]:::rib
F[F 5]:::defaultw
G[G 5]:::defaultw
H[H 7]:::defaultw
I[I 8]
J[J 9]
K[K 10]
L[L 10]
M[M 10]
N[N 13]
O[O 14]
P[P 14]

classDef default fill:#6df,stroke:#fff0,color:#000
classDef defaultw fill:#6df2,stroke:#fff0,color:#0002
classDef rib fill:#000,stroke:#fff0,color:#fff
AljoschaMeyer commented 1 year ago

Any particular reason why you use the count of older (logically) messages rather than the depth (length of a longest path from the message to the root)? A fairly simple alternative scheme would be to have each message of depth d include the hashes of all known messages of depth d - 1 and of all known messages of depth f(n), where f is the function for computing skiplinks in your favorite binary linking scheme.

I'm not buying into the concept of anchors, but this scheme works regardless of whether you have them or not.

staltz commented 1 year ago

Count: strawman!

Using graph depth is a pretty sweet idea! I'll explore that tomorrow.

Anchors: they were implicitly dropped as soon as I started describing Lipmaa Tangles.

There is still an open problem with Lipmaa Tangles: how to determine the have-range and want-range for the DAG Sync algorithm. I have a hunch the extremities of the range (which is a tuple) have to be "merge nodes" in the tangle.

gpicron commented 1 year ago

Ok, I think I was bad at explaining myself... I see a lot of what I tried to propose earlier. Note among other on the generalization of "backlinks" https://github.com/ssbc/ssb2-discussion-forum/issues/10#issuecomment-1455207795 and a generalization of the anchor principles and replication principles in https://gpicron.github.io/ssb-clock-spec/

There is a lot and I will try to comment in order:

Attention point 1

In the first message, you forgot one key point: Timestamps.

Yes timestamps are unreliable to express a verifiable causality as such but... There is one verification rules that is important to keep in mind and can be very useful.

  1. the timestamp of a message MUST be strictly larger than any message referred by the backlinks (previous, tangles, etc.) to be valid
  2. the timestamp MUST be smaller than the receiver own clock. (during the replication, the receiver of a message with a timestamp larger than its own local clock should at least delay it until its own local clock reach the timestamp)

Actually, that can be seen as a local consensus of the elapsed time since the root of tangle or since any "anchor" joining several branches more generally. That's an important information that you can use:

Attention point 2

I'm not convinced that lipma "jumps" are needed, maintaining the anchors chains has a minimal memory footprint and make life easier. https://github.com/ssbc/ssb2-discussion-forum/issues/18#issuecomment-1486575922

Attention point 3

I didn't made the graphs yet to clarify as promise in https://github.com/ssbc/ssb2-discussion-forum/issues/16#issuecomment-1480366242. I will do now and show how it generalizes for "everything is tangle"

Replication of feeds based on anchors

As a reminder, we were at that point in our discussion: image

This is about the same as, except that I think that same keys signing regular should not be stored on several devices and anchors are the opportunity for key rotation and snapshots CRDT (or ssb-archive).

After the lunch break now. Put on your seat belts, because I'm gonna mix everything together.

When maintaining a single-author tangle (a "feed"), we need sliced replication. See #19 as an example. We need #18, sliced replication. One simple system is to have a some "anchors" which are the black nodes below.


---
title: Regularly space "anchors"
---
graph RL
J-->I-->H-->F & G
F & G-->E-->D
D-->C & B
B & C-->A

A:::rib E:::rib I:::rib

classDef default fill:#6df,stroke:#fff0,color:#000 classDef rib fill:#000,stroke:#fff0,color:#fff


My legend for subsequent graphs.

Note: 
* CID (blue) is not a message but the Context ID (generated by 256 bit hashing the password-based Master Key and the Social Context name)
* Anchor (black) is a message that is signed by the Master Key initially, then by the previous anchor key and contains the new key and the context ID (dark link), there is backlinks to the previous anchor (dotted) and previous message(s) (solid).

```mermaid
graph RL

Ab --> y --previous--> x --> Aa
Ab -.anchor.-> Aa ==context==> CID
Ab == context ==> CID

CID:::context
Aa:::anchor
Ab:::anchor
classDef anchor fill:#000,stroke:#fff0,color:#fff
classDef context fill:blue,stroke:#fff0,color:#fff
classDef master fill:red,stroke:#fff0,color:#fff

Showing a Device feed that has forked:


graph RL

subgraph MK [Master Key]
    MK2 --------> MK1
end

subgraph SC1 [Social Context 1]
    subgraph CID1 [ ]
        direction BT
        CTXID1 ~~~ MK1        

    end

    subgraph D1 [Device Feed 1]
        G --> L --> D1K3 --> D & F --> C --> D1K2 --> B --> A --> D1K1 --> MK1
        D1K3 -.-> D1K2 -.-> D1K1
        D1K1 ==> CTXID1
        D1K2 ==> CTXID1
        D1K3 ==> CTXID1
    end

    subgraph D2 [Device Feed 2]
        D2K3 --> K --> J --> D2K2 --> I --> H --> D2K1 --> MK1
        D2K3 -.-> D2K2 -.-> D2K1
        D2K1 ==> CTXID1
        D2K2 ==> CTXID1
        D2K3 ==> CTXID1
    end
end

D1K3:::anchor
D1K2:::anchor
D1K1:::anchor
D2K3:::anchor
D2K2:::anchor
D2K1:::anchor

CTXID1:::context

MK1:::master
MK2:::master

classDef anchor fill:#000,stroke:#fff0,color:#fff
classDef context fill:blue,stroke:#fff0,color:#fff
classDef master fill:red,stroke:#fff0,color:#fff
classDef hide fill:#0000,stroke:#0000,stroke-width:0px,color:#0000;stroke-opacity:0

Sliced replication using the algorithm proprosed earlier

During the initial exchange each peer send the list of CTXID it is replicating (possibly using a BF) and a timestamp in the past they want to replicate from. The common timestamp is the largest one of the two wish (max)

Then, each peer computes their the intersection of CTXID for which they have a common interest. Then can can determine the oldest anchor before the timestamp requested.

So let's say that Alice has the feed illustrated before. It receives from Bob the CTXID corresponding in his "want list" and the max timestamp is between the timestamp of D nad F on feed D1 and between J and K in Device Feed D2. The oldest anchors before the timestamp are D1K2 and D2K2.

Alice can compute a Bloom Filter (or a Bloom Clock) containing all anchors since the root of the feeds and all messages she know after the anchors D1K2 and D2K2. So D1K1, D1K2, C, F, D1K3, L, G, D2K1, D2K2, J, K, D2K3 + all messages of the Master Key feed (remember that this is the feed that can "kill" a feed when the feed key is lost, so it is important to ensure that it is alway replicated fully as soon as possible but it does not contains a lot of messages, just long term key rotation and kills)

Bob will do exactly the same.

At this point, they have both enough information to determine which messages they have that the other wish and send them to each other.

Generalization to tangles

Let say this is what Alice has in its database for User A and User B.


graph RL

subgraph MKA [Master Key - USER A]
    MKA2 --------> MKA1
end

subgraph SCA1 [Social Context 1 for A]
    subgraph CID1 [ ]
        direction TB
        CTXID1 ~~~ MKA1        

    end

    subgraph D1 [Device Feed 1]
        G --> L --> D1K3 --> D & F --> C --> D1K2 --> B --> A --> D1K1 --> MKA1
        D1K3 -.-> D1K2 -.-> D1K1
        D1K1 ==> CTXID1
        D1K2 ==> CTXID1
        D1K3 ==> CTXID1
    end

end

subgraph MKB [Master Key - USER B]
    MKB2 --------> MKB1
end

subgraph SCB1 [Social Context 1 for B]
    subgraph CID2 [ ]
        direction TB
        CTXID2 ~~~ MKB1        

    end

    subgraph D2 [Device Feed 2]
        D2K3 --> K --> J --> D2K2 --> I --> H --> D2K1 --> MKB1
        D2K3 -.-> D2K2 -.-> D2K1
        D2K1 ==> CTXID2
        D2K2 ==> CTXID2
        D2K3 ==> CTXID2
    end
end

K --reply--> D

D1K3:::anchor
D1K2:::anchor
D1K1:::anchor
D2K3:::anchor
D2K2:::anchor
D2K1:::anchor

CTXID1:::context
CTXID2:::context

MKA1:::master
MKA2:::master

MKB1:::master
MKB2:::master

classDef anchor fill:#000,stroke:#fff0,color:#fff
classDef context fill:blue,stroke:#fff0,color:#fff
classDef master fill:red,stroke:#fff0,color:#fff
classDef hide fill:#0000,stroke:#0000,stroke-width:0px,color:#0000;stroke-opacity:0

Scenario 1:

Bob tell Alice he wish to replicate User A feeds (so CTXID1) and the timestamp is between C and D1K2.

Following the logic explained before, Alice would send a BF containing D1K1, D1K2,C,D,F, D1K3, L, G, MKA1 and MKA2. But Alice knows D received a reply from user B after the timestamp of D, so after the anchor D1K2 (because of the rule that timestamp of messages must be strictly larger than the ts of the backlinked messages). So Alice will actually include the message K, J, D2K2, D1K2, MKB1, MKB2 in the BF.

Even if the K is not part feeds of the common Context ID set negociated in the first phase, it is logically included in the common set of messages to sync.

Scenario 2:

Bob tell Alice he wish to replicate User B feeds (so CTXID2) and the timestamp is between J and K.

The common set of message to sync is MKB1, MKB2, D2K1, D2K2, J, K, D2K3 + D, C, D1K2, D1K1, MKA1, MKA2.

ahdinosaur commented 1 year ago

@staltz wow great work on the new tangle-centered design (including lipmaa-style links :exploding_head: ), i appreciate your diagrams and explanations, this is very exciting. :purple_heart:

staltz commented 1 year ago

@gpicron Those are very useful diagrams and descriptions, thanks. But could you move all of that to #16 which is where we discussed most of your "Identity Tree" idea? (I don't know what to call it, so I'm calling it this for now)

I'm exploring two different approaches, and I'm still not sure which one to take: Identity Trees (also known as "Option A") or Tangles-everywhere-with-a-single-ID (Option B). This issue is about exploring depth-first whether Option B would work, and how it would work, so reserve your Option A proposals for another issue to keep this conversation focused.

staltz commented 1 year ago

In the first message, you forgot one key point: Timestamps.

Yes timestamps are unreliable to express a verifiable causality as such but... There is one verification rules that is important to keep in mind and can be very useful.

  1. the timestamp of a message MUST be strictly larger than any message referred by the backlinks (previous, tangles, etc.) to be valid

  2. the timestamp MUST be smaller than the receiver own clock. (during the replication, the receiver of a message with a timestamp larger than its own local clock should at least delay it until its own local clock reach the timestamp)

I'm afraid this wouldn't work at all in a network where at least one peer has a completely broken timestamp (assume someone set up their computer's clock incorrectly, as a mistake or intentionally). Then all the other peers would have to wait to reach that broken timestamp. On SSB, I have actually seen timestamps years into the future, so waiting for timestamps to align would be disastrous. It would be a vulnerability.

I think numerical timestamps should not be part of the distributed system's consensus building mechanism, they should just be taken at face value. Or then, what we do in SSB today: take the minimum of the declared-at-publish-time timestamp and the received-and-persisted-to-disk timestamp, and use that to display in the user interface.

staltz commented 1 year ago

Here's the tangle example with lipmaa links using depth instead of preceding count. It's not as aesthetically pleasing as before, because the example isn't hand-crafted to look good, but that's a good reality check anyway:

graph RL
I-->H-->F & G
F & G-->E-->D
D-->C & B
B & C-->A
J-->I & H
K-->J
L-->J
M-->J
N-->K & L & M
O-->N
P-->N

F & G-.->A
K & L & M-.->F
K & L & M-.->G
linkStyle 20,21,22,23,24,25,26,27 stroke:#f00,stroke-width:2;
%% linkStyle 0,1,2,3,4,5,6,7,8,9,13 stroke:#0002,stroke-width:3;

A[A 0]
B[B 1]
C[C 1]
D[D 2]
E[E 3]
F[F 4]
G[G 4]
H[H 5]
I[I 6]
J[J 7]
K[K 8]
L[L 8]
M[M 8]
N[N 9]
O[O 10]
P[P 10]

classDef default fill:#6df,stroke:#fff0,color:#000
classDef defaultw fill:#6df2,stroke:#fff0,color:#0002
classDef rib fill:#000,stroke:#fff0,color:#fff

And here is how the tangle looks like if I want to maintain only the range [I, ∞]. (And I think the choice for I should be such that for a given depth d, then I is the only msg that has depth d. If there are many msgs with depth d, choose a different d)

graph RL
I-->H-->F & G
F & G-->E-->D
D-->C & B
B & C-->A
J-->I & H
K-->J
L-->J
M-->J
N-->K & L & M
O-->N
P-->N

F & G-.->A
K & L & M-.->F
K & L & M-.->G
linkStyle 21,25,26,27 stroke:#f00,stroke-width:2;
linkStyle 20,22,23,24 stroke:#f002,stroke-width:2;
linkStyle 1,3,4,5,6,7,8,9 stroke:#0002,stroke-width:3;

A[A 0]:::rib
B[B 1]:::defaultw
C[C 1]:::defaultw
D[D 2]:::defaultw
E[E 3]:::defaultw
F[F 4]:::defaultw
G[G 4]:::rib
H[H 5]:::rib
I[I 6]
J[J 7]
K[K 8]
L[L 8]
M[M 8]
N[N 9]
O[O 10]
P[P 10]

classDef default fill:#6df,stroke:#fff0,color:#000
classDef defaultw fill:#6df2,stroke:#fff0,color:#0002
classDef rib fill:#000,stroke:#fff0,color:#fff
AljoschaMeyer commented 1 year ago

the a minimal set of messages

I don't think storing a single shortest path, i.e., a minimal set of messages, suffices. Consider the following counterexample:

counterexample

The union of the certificate pool of vertex 2 (green) and the orange path from vertex 8 to 0 do not suffice to prove that 2 came before 8.

ahdinosaur commented 1 year ago

what if we do more with the tangle's "merge nodes"?

but that last bit only makes sense if a merge node knows it's a merge node... hmm.

AljoschaMeyer commented 1 year ago

The easiest solution is to use all paths to zero that consist of skiplinks only.

staltz commented 1 year ago

@ahdinosaur You're onto something, and I think the replication algorithms (AND the deletion algorithms, see #19) depend on the existence of these stable merge nodes. But it seems really hard to come up with a rule for (deterministically) choosing merge nodes. Seems like there can always be that sudden "appendage" happening, as a fork right before the merge node.

I think this is the most important issue to address before exploring these "Lipmaa Tangles" in more details.

gpicron commented 1 year ago

what if we do more with the tangle's "merge nodes"?

  • merge nodes are similar to anchors, except merge nodes happen naturally rather than anchors which were scheduled

  • you can only replicate a range starting or ending on merge nodes (or beyond to the frontier)

  • to verify causality of two messages in between a merge node, you must replicate from each message to both sides of merge nodes

  • only merge nodes participate in the skip links (lipmaa cert pool) game

but that last bit only makes sense if a merge node knows it's a merge node... hmm.

I think that the main reason to make a difference between anchors and normal message that are "merge node" is that anchors are always very small and cheap to replicate and stored.

AljoschaMeyer commented 1 year ago

My gut feeling is that anything that is dependent on single merge nodes is doomed from the start. Chances are you have to define everything in terms of antichains (sets of messages such that no pair of messages in the set is causally related to another), which are the natural generalization of individual "merge" or "anchor" nodes.

The scalability of everything depends on the largest antichain in a tangle - for example the number of all skiplink-only paths from some node to zero is upper-bounded by the size of the largest antichain (the width) of the tangle.

If you enforce that all messages in a tangle by a single author must be causally related, then the number of authors in the tangle gives an upper bound on its width.

If you assume that tangles have small, bounded width - say at most k - data replication becomes fairly simple: just use a logical vector clock of k entries. If you have a bounded number of authors, that clock is simply a list of pairs of author ids and sequence numbers. The size of this is always linear in k. And that doesn't change if you throw bloom filters at the problem (as far as I understand things, @gpicron's approach is to compress these clocks with bloom filters, and the Kleppmann approach also boils down to compressing some vector of at most k messages).

If you can reasonably assume k to be small, say if the tangle corresponds to an identity made up of at most a few dozen devices (and that's already very generous of an estimate) and each device produces a linear log, just choose one of those approaches. If you cannot bound k, then the width of a tangle is linear in the size of the tangle in the worst case (a bunch of concurrent replies to a single message, for example). Since there is no structure to leverage in an antichain, you cannot do better than using a general-purpose set reconcilliation protocol for syncing such tangles.

Essentially, for tangles of bounded width, the problem is fairly boring. For tangles of unbounded width, the problem is also fairly boring. Much more interesting is the question of whether assuming small width and accepting degraded performance when that assumption does not hold is worth the efficiency gains in the happy case over a set reconciliation approach.

gpicron commented 1 year ago

I'm afraid this wouldn't work at all in a network where at least one peer has a completely broken timestamp (assume someone set up their computer's clock incorrectly, as a mistake or intentionally). Then all the other peers would have to wait to reach that broken timestamp. On SSB, I have actually seen timestamps years into the future, so waiting for timestamps to align would be disastrous. It would be a vulnerability.

Ok let's first replace the MUST by SHOULD in the rule 2.

The first rule just enforce a consistency between backlinks and timestamps.

I didn't precised it but the receiver local clock is not the host RTC clock but a local of time elapsed since the "oldest" validated message received. Actually for me there is no RTC, I think SSB should be usable even on a microcontroller without RTC.

The reason to have the rule 2 is to protect against the case where one send a message with a ts in 2 years. Because this is the vulnerability. Bug or intentional, today this is accepted and is confusing.

It's then probably better to drop completely the timestamp field. It's a waste of bytes. No timestamp is probably better than inconsistent one when you resonate in term of all is tangle.

But I wonder how you will distinguish recent messages that you wish to replicate and old ones that can be forgotten without at a minimal trustable notion of the elapsed time between 2 messages in a tangle

staltz commented 1 year ago

I think @AljoschaMeyer is right about merge nodes.

I wanted to get an intuition how these algorithms are going to work in more realistic scenarios than the paper examples we're making, so I built a simple simulator that spits out these mermaid diagrams of tangles. Here's the repo: https://github.com/staltz/tanglesim

And here is an example simulation run: https://github.com/staltz/tanglesim/blob/master/output.md

I think we don't even need merge nodes! Supposing we have a deterministic tie-breaking rule when topologically sorting the tangle, we can pick a point in that topo-sorted list, delete everything before it except the certpool, and then repeat this process in the future.

Newly discovered messages that point to some non-certpool deleted message (the faded out msgs) should be ignored (i.e. considered invalid) during validation #21 .

staltz commented 1 year ago

I also ran these simulations at large (and SSB-realistic) scale: 8000 messages, for instance.

It's really promising that the size of the cert pool is still quite small. It's not very deterministically countable, because it depends on the shape of the tangle, but it's small. Some example numbers:

Another run:

gpicron commented 1 year ago

Newly discovered messages that point to some non-certpool deleted message (the faded out msgs) should be ignored (i.e. considered invalid) during validation #21 .

Why ? They were maybe just locked in an "island" missing connection to the rest of the world or having on the opportunity to connect once every x...

How do you define old messages ?

gpicron commented 1 year ago

Another question : if I understand well, you make the assumption that you can have a algorithm that make all converge you towards the same certpool. So that a new user coming will only replicate messages back to some point and then the nodes of the cert pool. Correct ?

What about "blocking" ? What if I am actually blocking one of the author in this shortest path ? And taking into account that at any point in time I can decide to block or unblock an author.

AljoschaMeyer commented 1 year ago

Re blocking: you'd have to define blocking as "not requesting/storing any content", whereas the metadata (signature, backlinks, a hash of the content) would be fair game.

staltz commented 1 year ago

How do you define old messages ?

@gpicron Old = "has not synchronized since the last ~300 recently published messages". The number "300" is approximate, it's just based on the estimate of max 200kB reserved for each person, in the 64MB database limit, and assuming 1 msg is approximately 660kB encoded in JSON. If the single-person multi-device tangle author publishes very frequently, then "old" can be as little as 1 week. But if they publish just once a week, then "old" can be many months or years.

They were maybe just locked in an "island" missing connection to the rest of the world or having on the opportunity to connect once every x...

@gpicron Suppose there is a single-person multi-device tangle. If that person published ~300 messages on their desktop, but supposing their phone has NOT synchronized those 300 recent messages, then any message that the phone wants to synchronize to the desktop (direction of replication is phone=>desktop) will be ignored by the desktop because such message would not backlink to any of the 300 recent messages that the desktop recognizes is in its tangle.

So this system assumes that devices can replicate frequently. In the case a device ends up being too far left behind, then this device just needs to do a "reset" by deleting their local tangle and getting everything from the remote tangle.

Important note: each single-person multi-device tangle has a single msg type! So there is one tangle for posts, another tangle for reactions, and another for follows, etc. Only some of these tangles would employ sliced deletes, such as reactions and posts. The follow tangle would be important to preserve in full, and we could employ other delete techniques such as PREDSL which doesn't drop data from the follow data structure.

What about "blocking" ? What if I am actually blocking one of the author in this shortest path ? And taking into account that at any point in time I can decide to block or unblock an author.

This use case would basically not happen.

And even if we find a use case where blocks would touch the tangle structure, what @AljoschaMeyer said is on spot: if there is a certpool msg from a blocked person, we don't delete that msg, we just delete the contents of the msg. But that's unrelated to blocking, because all certpool msgs have deleted content anyway, regardless of blocking or not.

gpicron commented 1 year ago

thanks for the clarifications.