Bikeshedding my DAG feed format

staltz commented 1 year ago

So I'm pursuing Option B (all devices share the same keypair and publish to a DAG feed) and I'm now designing the feed format.

What I'm presenting here is my best design so far. I like it, but it's also open to bikeshedding because nothing is set in stone yet. It's similar to minibutt, but there are important differences.

JSON encoding

I think JSON is a good compromise between readability and parsing/serializing performance. You could get faster than JSON with a binary encoding, but it hurts readability/debuggability, and actually it is pretty hard to do better than JSON in all metrics. JSON parsing in JS engines is actually very fast. JSON is good enough! And ubiquitous.

Shape

It looks like this (example):

{
  "content": {
    "text": "3rd post forked from 1st"
  },
  "metadata": {
    "proof": "EHsKZkJ5p9kehQR5AEwRQA",
    "size": 35,
    "tangles": {
      "PpkBfa8C4sB8wHrqiNmHqe": {
        "depth": 2,
        "prev": [
          "LfNVJZdEBquVyJg8DQ9mgX"
        ]
      }
    },
    "type": "post",
    "who": "4mjQ5aJu378cEu6TksRG3uXAiKFiwGjYQtWAjfVjDAJW",
    "when": 1681031939360
  },
  "sig": "5CEyFspFuRtJMMkBAjwNkzNwYjV2xDekawazTjufxwcFc3ufhyxgNexxiPCV3cs4Yid3Smrau8ciEQDjBSP8v3Y9"
}

Explaining it top-down:

{
  content, // deletable, while you can still keep metadata around and replicate a contentless msg to others
  metadata,
  sig, // (base58) signature of the metadata JSON in a strictly fixed serialization order (alphabetically ordered fields)
}

The metadata has:

proof: hash of the content in base58
size: byte count of the content JSON serialized
tangles: (more about this soon, but for now, "depth" is like "sequence" and "prev" is like SSB's "previous")
type: message type goes in metadata instead of msg.content.type
who: author public key in base58, this is like SSB's author
when: timestamp of msg creation time

Tangle-native

You probably noticed msg.metadata.tangles[someId].

The idea with this feed format is that you always replicate messages as part of some "tangle". A message in a feed is always part of the feed's own natively supported tangle. I.e. a feed is a DAG (i.e. a tangle), so every message in the feed will be part of this "feed tangle". Notice that the identifier in this case for the "feed's native tangle" is PpkBfa8C4sB8wHrqiNmHqe. This is the msg hash of the 1st message in the feed.

The 1st message in the feed is special, it's like this:

{
  "content": null,
  "metadata": {
    "proof": "",
    "size": 0,
    "tangles": {},
    "type": "post",
    "who": "4mjQ5aJu378cEu6TksRG3uXAiKFiwGjYQtWAjfVjDAJW",
    "when": 0
  },
  "sig": "4if5hNP3CgjdPWyyniKoeukdVoZVxUAi8ZcJ9dotYDrKTZZ1ZgDVgTBeBZ6v2wj5CMGr6mJSvorMH4ggLJvUVTie"
}

Any device that has the keypair can deterministically recreate (thus publish) this first message. This is important because every tangle must have only one root, and we can't afford waiting for devices to synchronize so that the new device downloads the 1st message from the old device. So instead we have a convention where the 1st message in a feed is a dummy message, this way all devices can start from the same root message.

So PpkBfa8C4sB8wHrqiNmHqe is the stable msg hash for this dummy message. Of course, with a different author pubkey and a different type, you will get a different dummy msg and a different hash.

In PPPPP, when you replicate a message from a tangle, you have to also inform what root msg hash are you validating against. The most common root msg hash will be the first feed's message, such as PpkBfa8C4sB8wHrqiNmHqe, but it could be something else as well!

Say you are replicating a discussion thread. Then the root msg hash is not the dummy message, it is the post that started the thread. And the replies can come from different authors. So the tangles field will look like:

"tangles": {
  "PpkBfa8C4sB8wHrqiNmHqe": { // <-- feed 1st msg (dummy)
    "depth": 2,
    "prev": [
      "LfNVJZdEBquVyJg8DQ9mgX"
    ]
  },
  "RG3uXAiKFiwGjYQs6s4Adr": { // <-- thread root
    "depth": 4,
    "prev": [
      "otYDrKTZZ1ZgDVgTBeBZ6v", 
      "5jdPWyyniKoeukdVoZVxUA", 
    ]
},

So if you are replicating the whole feed, you will request for the tangle PpkBfa8C4sB8wHrqiNmHqe and you will validate the incoming messages according to msg.metadata.tangles['PpkBfa8C4sB8wHrqiNmHqe'], BUT if you are replicating just the discussion thread, you will request for the tangle RG3uXAiKFiwGjYQs6s4Adr and you will validate the incoming messages according to msg.metadata.tangles['RG3uXAiKFiwGjYQs6s4Adr']. This is a neat evolution from SSB's out-of-order replication, which just ignores backlinks entirely. Here, we have the option of validating different kinds of backlinks. This is also why I don't see a need for OOO in PPPPP.

These hashes are also called the "tangle ID". Each tangle is identified by its unique root. So you replicate tangles not feeds. It just happens to be that some tangles represent feeds.

So depth refers to the size of the longest path from this msg to the tangle's root, and prev refers to older msgs in the tangle, either immediate predecessors, or tangle tips, or lipmaa backlinks, or all of these together.

Open problems

Since you will replicate a tangle based on its tangle ID, then we have a new kind of bootstrapping problem: you can't tell a remote peer that you want to replicate a feed based on the author's pubkey. Instead, you have to know the msg hash of the tangle feed's root. That's one unfortunate indirection. But one naive idea to solve this is that the first time you replicate a feed (that is, you don't have any msgs from that feed yet), you ask the remote peer "for this given author pubkey and this type, tell me what the tangle ID is for that tangle feed". In case you already have some messages for this feed, then you know the tangle ID.

Thoughts? Feedback? This is the best moment to change my mind, otherwise I'm going to invest in this direction heavily.

cc @arj03 @ahdinosaur

arj03 commented 1 year ago

I like this design. Might have a few more things to say later, but wanted to start with two things:

The feed versus tangle root id might not be such a big problem. Because everything is tangles, then it's much more unique to reference things by the root then it is by feed anyway.

One question, how do you envision handling private messages? You have tangle and type info in the meta.

staltz commented 1 year ago

That's a very good question, I had to think about it for a while.

`type`

The case for private type is easy. You don't have to have it, you can leave it empty, which means you're using the "top-level" feed, which can fit messages of any type (similar to main in SSB today), and then you can set msg.content.type anyway since content is free-form.

Or, depending on the case, you can set the msg.metadata.type and leave it unencrypted, but it might not be a problem, depending on the application. For instance, say you have msg.metadata.type = 'chess' for all the messages for your chess game. Some of the messages in that feed can be encrypted (e.g. chess chat), others can be public.

`tangles`

This one is harder to address. But I think I found a rule of thumb.

If the tangle is a feed, then you can just leave the tangle metadata public, and it doesn't actually reveal that much. It's the same as having sequence and previous (always public) in encrypted SSB messages.

If the case is multi-author private messages, though, you probably don't want to tangle them together. The idea is that a tangle is a way of grouping messages and referring to that as a replication target. So you should only have tangles for publicly grouping messages together and requesting that group. Obviously, as soon as you publicly group messages, you're revealing that they belong to the same group. So if you don't want to reveal that metadata, don't use a tangle. This also means that you won't be able to replicate them as a group.

Unless we figure out another algorithm / scheme that solves that...

ahdinosaur commented 1 year ago

i like this, well-done @staltz ! :purple_heart:

regarding the special first message, reminds me of the special first message i discovered in the ssb0 feed i found:

{
  "previous": null,
  "author": "D28WPyLA35x/B4oFyYtemjMhtpO2a74yjsCZggpzXFs=.blake2s",
  "sequence": 1,
  "timestamp": 1418085913971,
  "hash": "blake2s",
  "content": {
    "type": "init",
    "public": "+9GqatIWfb3hmTKl40Xr2R69Su4PHasCMsLxW+Mmdr+iiBHxzZzafPR74aEzFBubR6rtD5KLhKGkwtg7NmDs7A==.k256"
  },
  "signature": "YbWeXHoLSDrv3u/hL0U8z+uzrVisNUsxp7hePgZkmeEQEn1s6g755Mxlty08Qh5X44Wu7UlNA8n3JnFYke5c0Q==.blake2s.k256"
}

where this special first message is used to declare the public key used by the feed, which will be referenced by hash (of the public key, not of the first message) in all feed messages.

so if we need this indirection already, we could add a layer of indirection to the public key references? so if you see someone's message, you don't automatically know their public key?

but i'm not sure that's worthwhile, seems unnecessarily complicated. in Scuttlebutt we already assume that if you have someone's message at all, you have someone's public key, and therefore the "read capability" of their feed.

another thought, do we not want to separate the different "prev" links: tangle tips and lipmaa backlinks? i like the simplicity of them combined, and even knowing how to compute tangle tips will require a specification for implementers, but i wonder if makes things less obvious when you look at a message what is happening. doesn't say why a reference is there, them being combined doesn't help if you want to perform an algorithm that involves traversing only tangle tips or lipmaa backlinks.

yet another thought, do we want to encode the hashing algorithm anywhere? ssb1 included the hashing algorithm as a field in the feed, plus at the end of hash references, in hopes of helping upgrade-ability, but i'm not sure it actually achieved that.

at the moment, i assume the design decision is to not include the hash anywhere, because it's fixed, and if we want to upgrade the hashing algorithm we should just upgrade the entire feed format. and i think this is a fine design decision, just mentioning as a potential thing to consider.

then for more bike-shedd-y thoughts:

why proof instead of hash?
why is the proof of the initial message an empty "" and not null?

staltz commented 1 year ago

so if we need this indirection already, we could add a layer of indirection to the public key references? so if you see someone's message, you don't automatically know their public key?

@ahdinosaur Hmm! This is a possibility that I haven't seen before. I don't know if it's worth doing it exactly as you said, because feed msgs are also going to be replicated "out-of-order" (that's not the correct term anymore, so I'll call it "out-of-feed" OOF instead) e.g. when you replicate a thread and you want all of the replies regardless of hops distance. In the case of OOF, you need to know the author ID so you can then fetch only the author's profile details (about) and get their name and picture when they're replying.

But it's good food for thought, we could put something special in the init message.

another thought, do we not want to separate the different "prev" links: tangle tips and lipmaa backlinks?

This thought crossed my mind too, but without dwelling on it too much I chose to put them together to save some bytes. I guess that the structure of backlinks shouldn't be very "semantic", it's not something that you get to customize. The backlinks are there to serve just to create causal order, and lipmaa backlinks are just redundant for causal order, but they make deletes easier. What backlinks you should use in what situations should be algorithmically deterministic, and boring.

In fact, it's possible to have a normal backlink (not a lipmaa one) that also makes a big leap back into the past, here's an example. Suppose you have a tiny dangling branch B (created by using another device for the first few hours of your account, and then forgetting about it for months). When you finally sync that device branch with other device branches, the next message J is going to "merge" the dangling branch, and this will provide a big leap back into the past.

graph RL
E-->A
I-->E
J-->I-->H-->G-->F-->E-->D-->C-->A
B-->A
J-->B
%% linkStyle 6,7,8,9,10 stroke:#0002,stroke-width:2;
linkStyle 0,1 stroke:#f00,stroke-width:3;

classDef default fill:#6df,stroke:#fff0,color:#000
classDef defaultw fill:#6df2,stroke:#fff0,color:#0003
classDef rib fill:#000,stroke:#fff0,color:#fff
classDef ribw fill:#0002,stroke:#fff0,color:#fff

When deleting past messages, we don't give any preference to lipmaa links over conventional links, we just look for the shortest path to the root. If we need to delete all messages before J, then we're going to keep J-->B-->A since it's shorter than J-->I-->E-->A (which contains two lipmaa leaps).

So in my view, these backlinks should have the same priority, and lipmaa links are just "sprinkled redundancy" in case we don't naturally get these big leap "normal" backlinks.

yet another thought, do we want to encode the hashing algorithm anywhere?

In my code, I'm calling this feed format by the name "FeedV1", so it's just about versioning. Actually, the most likely thing I could do is add a field msg.metadata.v = 1. Versioning plus a specification gives all the field semantics we need, and we don't need self-describing fields. As long as the fields are inside the message, and the message is clear that it's a PPPPP v1 feed, then the semantics should be obvious. However, if you take one of those fields out of the message (e.g. who author ID) then we need self-describing identifiers like a URI that describes what this is, since it was removed out of its context.

So the specification will determine what hashing algorithm is used, what the who keypair identity is using, and so forth, and if we need upgradability, make a new feed format version.

why proof instead of hash?

Good point! Maybe I should.

why is the proof of the initial message an empty "" and not null?

I didn't give much thought to it when I did it, I guess I could make it null.

ahdinosaur commented 1 year ago

This is a possibility that I haven't seen before.

for context, Bitcoin (or Ethereum, etc) addresses are hashes of the public keys:

https://en.bitcoin.it/wiki/Technical_background_of_version_1_Bitcoin_addresses

PubKeyToAddr

Repp12 commented 1 year ago

If timestamps aren't reliable and if I understood it correctly, are mainly for UI purposes, then to improve privacy, why not put the "when" inside content instead of metadata where it can be deleted if no longer useful.

staltz commented 1 year ago

If timestamps aren't reliable and if I understood it correctly, are mainly for UI purposes, then to improve privacy, why not put the "when" inside content instead of metadata where it can be deleted if no longer useful.

Great suggestion!!

quickdudley commented 1 year ago

then we have a new kind of bootstrapping problem: you can't tell a remote peer that you want to replicate a feed based on the author's pubkey

If we omit the signature when hashing the root message of a feed then this problem is solved, and also we no longer need to explicitly store or transmit the root messages.

I'm not certain but I think it might be safe to always omit the signature when computing message IDs.

staltz commented 1 year ago

@quickdudley Great idea! I implemented that too.

staltz commented 1 year ago

Here's how the latest version of the feed format msg looks like:

{
  "content": {
    "text": "Hello world!",
    "when": 1681842582086 // optional
  },
  "metadata": {
    // hashes the `content` (encoded as alphabetically-sorted JSON with no spaces nor newlines)
    "hash": "9R7XmBhHF5ooPg34j9TQcz", 
    "size": 23, // size of the `content` (same encoding as above)
    "tangles": {
      // this is the feed's "root msg" ID, but there may be more tangles
      "3F26EgnwbMHm1EEeeVM1Eb": {
        "depth": 1,
        "prev": [
          "3F26EgnwbMHm1EEeeVM1Eb"
        ]
      }
    },
    "type": "post",
    "v": 1, // feed format version, so far hard-coded at 1
    "who": "4mjQ5aJu378cEu6TksRG3uXAiKFiwGjYQtWAjfVjDAJW"
  },
  // Signs the `metadata` (encoded as alphabetically-sorted JSON with no spaces nor newlines)
  "sig": "5abJdD6RRCsWXKJLaEKRhUb1HKh4aKPFteFRgUBfyJD4cFzo5MVaMdWbwM2CfpNRFSjR9NkczRL2LcSyQVThYnRr"
}

arj03 commented 1 year ago

Similar to when you might do the same for type. This could be useful for encrypted content

Edit: as you said above, when I reread the thread.

The important thing is what to do about tangles. This is the primary replication mechanism, but leaks a lot of information. The question is if one can do sympathy replication (within the party you can move that part ot content as well), because those nodes would then know quite a lot of metadata information. I guess it depends on what your target is.

staltz commented 1 year ago

@arj03 I admit I haven't given much thought to private messages, even though I do want to support them.

However, here's more details that may help inform what would private message threads look like:

There seems to be a fundamental limitation with tangle sync, which is: it groups a Set of msgs from different authors as one "replication target", and as such it inevitably links those msgs together, logically. An eavesdropper (or a sympathetic replicator) can learn that these msgs (and thus, their authors) are engaged in something private. I think this is true regardless of the details we have for metadata.tangles in this feed format. Maybe Aljoscha's Set Reconciliation would have the same issues.

One option is that an encrypted discussion thread could choose to not tangle msgs together with metadata.tangles, because there is no need to replicate that discussion thread "out-of-feed" OOF (this is the new "OOO"). You know all the participants of the discussion, so you replicate each participant's full feed, and some of those msgs will happen to be encrypted. Sort of like how classic SSB is today. With the benefit of sliced replication, of course.

However, an encrypted discussion thread with metadata.tangles would be very useful, because then it enables replicating one of the encrypted participants in OOF mode. Like suppose Alice is friends with Bob, but Alice wants to introduce Bob to Carol, but Bob and Carol do not want to follow each other yet. So Bob wants to replicate the encrypted discussion thread such that Carol's msgs are replicated OOF, so that Bob doesn't have to commit to being interested in all sorts of other content from Carol. That would be useful. But it leaks metadata.

Unless we find some smart solution, I think that's the wall we're hitting, and it's about a fundamental tradeoff: (A) either you replicate a participant's feed and leak no tangle metadata, (B) or then you replicate only the private discussion msgs but you leak tangle metadata.

(A) hurts storage, because you're asking for more data than you want.
(B) hurts privacy, because you're leaking who you're talking to.

I think (A) might be a good compromise to make. You could have a dedicated nested feed only for private messages. This way, (1) you have plausible deniability because the nested feed could be part of many discussions, or could be part of one huge discussion, (2) an eavesdropper can't learn who you're talking to, (3) because the nested feed is dedicated to private messages, it wouldn't hurt storage that much (not as much as SSB classic today which mixes contact with vote with post with about with git-update etc).

And (B) is kind of pointless because the purpose of private messages is privacy.

hamnox commented 1 year ago

I've been taking a look at sets/records/threads and wondering why their ancestors are kept as content instead of tangle prevs, and the thought that came to mind is "well... how is the validator going to know which is the identity tangle?" Tangles could have a type. That might simplify some replication logic. The information to put a message in context is there, but you only replicate the contexts you care about the message in.

And an encrypted tangle type could have the root point to a(n encrypted) pad for xor'ing the metadata. Or something.

ssbc / ssb2-discussion-forum