Single key/pair for multiple feed

ssbc / ssb2-discussion-forum

not quite tiny, also not quite large

17 stars 1 forks source link

Single key/pair for multiple feed #16

Open gpicron opened 1 year ago

gpicron commented 1 year ago

For me this is the main problem.

I can understand that you want to simplify, but that increase the risk of private key hacking (the more a key is used in different context and libraries, the more it is feasible to gain some access to it) and it increase the consequences if it happens.
Similarly to use that same key yet in current SSB for establishing Handshake in SHS.

Coming from an industry who care on security, there a clear and strict rule in that world : the private key of a authentication/signig key pair never leave the host where it was generated. So for me, ideally, except a master key pair whose private key is never stored on any disk (generated from a long seed user password for instance) and changed at each use, no private key of key pair should be shared on several device as a rule.

I can understand that the experience of Metafeed was not pleasant and reveal highly complex to implements on top of SSB1. But think about it: You have tried to implement a tree structure with forward links to leafs on top of a replicated log with backward links to previous message and not allowed to fork... It does mean that the concept of having tree of key/pairs and a hierarchy of identities and logs does not make sense. For me, this is more that the foundation was not allowing to implement it simply and forced to implement an additional concept of replication. @elavoie came to the same conclusion for ssb-tokens: impossible to implement if feeds cannot fork (or more strictly said, if forked feed create knowledge partition across the network).

Again, if SSB2 is only for public social chat, no problem. The consequences are no so important and honestly I'm not even sure that previous links are meaningful, and the signature is more an anonymous identification feature of the author than a certification. And establishing crypted tcp session is useless, it doesn't prevent really a interested party to discover your identity and what you are exchanging and messages them selve a yet protected against man in the middle attacks, so why spending cpu cycle to encrypt communication.

At replication level, make it simple, 2 steps only based bloom or cuckoo filters, one to exchange the list of pub keys, and second to exchange list of known messages from that keys within some time range.

But then, honestly, call it SB2 because we regress on the first S of SSB.

arj03 commented 1 year ago

I understand your concern, but no-one is forcing you to share the same key to have multi-device, you can if you want. As in classic SSB you can come up with other solutions for multi-key. Be that fusion identity or derived from a master key. The multiple previous does solve the forked feed case, even if you only use your key on a single device.

I have seen so many good developers burn out on SSB and a lot of that comes from the sheer amount of complexity (I know I keep repeating this, but it is really important for me). We need to define what a successful outcome of this process looks like. I do feel that the design your have shared does feel in a way more like a 2.0 of SSB. This design is maybe SSB 0.5 if you want. But that was sort of the goal actually. Now I have been wrong before. When we did the design of partial replication, aljoscha and keks was very keen on tangle sync. This is probably partly what lead to his master thesis. We went with metafeeds, mainly because compatibility with classic SSB was very important. Was that wrong? In hinsight, probably yes. This is why I do really enjoy these discussions and don't think we should rush to one solution.

staltz commented 1 year ago

I think your concern is quite valid @gpicron, and after all we might be creating a new problem: keypairs have never been shared across many devices before (when it has, it has led to forks).

I want to consider, though, the cost of multi-key management. It negatively affects:

The ability to refer to one person by one SSB ID
Implementation complexity in managing the many keys
Maybe more

gpicron commented 1 year ago

I do feel that the design your have shared does feel in a way more like a 2.0 of SSB.

Ok then I misunderstood the intention. I thought we were brainstorming on building SSB2 and putting SSB1 in maintenance mode.

If the intention is only to solve the issues at the top of the list of Andre then why thinking at a new feed format ? A new feed format means a lot of work to specify, share, implement and then the most difficult migrate people. Since I follow SSB I saw 4 or 5 new format with little added value appearing. None are actually used. Probably because it implies migrating all clients and they were not offering enough advantage.

replicating most recent only, is just a modification of EBT, call it EBT+
bytes on wire, just tell than now messages can be transmitted on wire in something like BSON and every clients can get a library in their language and support it in a few weeks. You can a single optional capability exchange just after the session established and clients knows if the remote support it.
accepting forks, is just a matter of replication again not of feed format.

Thinking of version 2 does not mean implementing it in one step. The process when you have to interoperate with other teams is to draw the target then trace the stepwise path to reach it. You can afford to change feed format and protocol message every month if you are building the sole client / sdk. For instance, Signal can do. But if you want to build a open Protocol, you are stuck for years with the decision you will take in the coming weeks. And if the rest of the community does not follow, it will not be ssb anymore but hard fork called Manyverse network

I don't say it is wrong. It is a choice. Just then I misunderstood.

gpicron commented 1 year ago

I understand your concern, but no-one is forcing you to share the same key to have multi-device, you can if you want. As in classic SSB you can come up with other solutions for multi-key. Be that fusion identity or derived from a master key. The multiple previous does solve the forked feed case, even if you only use your key on a single device.

I see it the other way around. As soon as, at the replication layer, we implement an algorithm that deal with forks and diffuse forked feed, this is de facto allowing reuse of same key pair on multiple devices. But this is pushing the responsibility to deal with forked feed to the upper layers...
I may be easier or more complex at the app layer depending on the context to have to deal with that added degree of freedom. Think about Post Threads and "likes" today, there is not 2 apps that treat and display them the same way to the users. It is good to have diversity, but the counterpart is that users are confused.

Having the technical capability to reuse keys across multiple device does not mean that we should promote it. Especially, here doing it is pushing to reduce the security for the users.

I have seen so many good developers burn out on SSB and a lot of that comes from the sheer amount of complexity (I know I keep repeating this, but it is really important for me).

I share that wish to simplify things. But... The complexity is not coming from the protocol. Honestly the protocol guide good and complete. It's not complex to the understand the concepts behind SSB and most developper don't care of the feed format or the replication mechanisms. From my personal experience, when starting to hack on SSB was the sdk. The stack is a bunch of components talking to each other without clear boundary and layers of abstraction, it's impossible easily tell which one you need without digging in the code of them. You add to that some usual SSB only dev primitive like pull-stream, bipf, sigils, mux rpc used both as backend interface and p2p network RPC, etc. The amount of things the developper need to understand and learn before sending a valid single message is enormous and it is not due to the protocol.

If we want to simplify things for devs, I think this is on the SDK reference architecture that we need to work, not a new feed format.

I want to consider, though, the cost of multi-key management. It negatively affects:

The ability to refer to one person by one SSB ID

Implementation complexity in managing the many keys

Maybe more

223086821-7831672a-6b30-4180-881c-afee17b28b19

The identity on the social media apps to refer to can be unambiguously defined to be the ID (hash) of the anchor message creating the identity.
That's a very common way to define key hierarchies, finding libraries that yet implement the most of the needs as a keystore API (not to tell that in most OS, you have some builtin capabilities). In JS world, you have pki.js or better the whole world of DID (for instance https://github.com/orgs/decentralized-identity/repositories?q=&type=all&language=javascript&sort=).

staltz commented 1 year ago

I'm not defending the shared keypair idea that much, I'm still open to a metafeed style such as what you're proposing. But I'm looking for the downsides of each approach (both approaches have downsides) and trying to figure out what compromise is best to do.

The identity on the social media apps to refer to can be unambiguously defined to be the ID (hash) of the anchor message creating the identity.

In your diagram, you have device 1 and device 2 identities, both created by social media identity. The missing detail in this diagram is that everything there is created on some device.

What device is the one creating the sub-identities for the other devices? We noticed this strict hierarchy in metafeeds and realized that they create a constraint, that you usually have to have some "main" device like a desktop, and this has a cost on user experience (suppose the desktop is destroyed and you only have mobile). This is what led us to create fusion-identity-spec which is more like a "flat hierarchy" of devices, but it has the problem where you can't refer to all devices using a single ID.

Specifically, in your envisioned design, how does the user:

Create a new device ID from any other device account?
Elect their "phone account" to be the custodian of "master identity" after the "desktop account" explodes?
While still preserving the ability to have a single SSB ID across many devices?

And a more general question: what property does you design sacrifice/compromise?

gpicron commented 1 year ago

You are right. There is a problem with the Social identity.

Note 1: I wrote it somewhere else, the Master Key is "never store" key. It is generated from a user long password, never stored and can be used only a very limited number of time before being rotated. So, from this one, you can create a Social Identity Key on any device.

The drawback as you mention it that Social Identity Key should be on only 1 device. That's wrong.

Let's try another thing: What is we don't derive the Identity from some key material but instead link them using the Master Key. DNS are not derived from the certificates keys...

Let say the Social Identity is a 256-bit random number. An using the Master Key, you create Device Key with reference to that Social Identity number. We can flatten the hierarchy without going to the complexity of pairing protocol as in fusion-identity.

That implies that messages related to the social identity would go back to the Device feeds and need to be managed as CRDT (OR-Set probably for the Follow/Block and LWW for the profile status and picture)

Create a new device ID from any other device account?

User need to use its password to create the first anchor of the device feed. That also "increment" a usage counter. When the counter reach a threshold (let say 3), the only next message that can be signed with that key is a new anchor to rotate it.

Elect their "phone account" to be the custodian of "master identity" after the "desktop account" explodes?

No need to elect a custodian

While still preserving the ability to have a single SSB ID across many devices?

The ID is not linked to any device Key. It is a abstract random 256 bit number.

The sacrifice is to impose to the user to remember a long password and to renew it periodically.

staltz commented 1 year ago

I wrote it somewhere else, the Master Key is "never store" key. It is generated from a user long password, never stored and can be used only a very limited number of time before being rotated.

Oh, I forgot about this, thanks for the reminder. Yes, it does change the game significantly, if the master key is in the user's brain.

Now, can you help me understand more details? You said

Let say the Social Identity is a 256-bit random number. An using the Master Key, you create Device Key with reference to that Social Identity number.

How do these "references" look like? Do you mean like an HKDF? Remember that when we talk about "keys" here, these are actually asymmetric keypairs, so there's the public part and the private part. Other peers need to use your public key as your cross-device-stable ID. So is the Social Identity number (SI) also a keypair? Doesn't seem like it is, because you said 256-bit random number.

This means that I could probably just pick a custom SI instead of randomizing it, and people might pick a custom SI that looks cute when encoded in base64. People already try to do this today in SSB, they can brute-force generate good looking IDs, but it takes so much computational resources, that you can only do that for the first few characters of the base64-encoded public key. Now it becomes much easier to do this with SI where you don't have to brute-force it, you can just directly do it. I believe this would lead to a lot of people choosing their own good looking SI, which is okay, but it also raises the risk of SI collisions.

gpicron commented 1 year ago

Social Identity number is not a keypair. You can consider it as just a contextual sub-identifier of the Master Identity for the Social Media apps.

Actually, we can imagine to propose something like 10 user-chosen printable ascii char + remaining random bytes in the UI for instance to avoid most collisions.

The real unique id for the context "social media" is the compound (pub master key that initially signed it + that ID).

You could even avoid it completely and sign directly the Device keys with the Master Key but I find it interesting to have some contextual information and also you may imagine in the future to have an ID for personal and an ID for professional for instance and therefore 2 Device Key per device for signing posts. This is more or less just an alias. Like a domain name is an alias for IP's in certificates.

staltz commented 1 year ago

I like more the idea that the master keypair defines your stable ID, because it avoids people messing up (too much) with brute forcing good looking IDs. This doesn't answer how to do key rotation, but at least we still have solved the one-keypair-per-device goal.

gpicron commented 1 year ago

Key rotation: anchor signed by the previous key.

Rules:

a device key should be used only for signing up to 500 KB of message
a device key cannot be used more than 6 months

Incentives to respect:

When B detect that A reached one of those limit, it delays the transmission to other peers any new message by a time proportional to the overflow and at least 1 day
To prevent one to generate too often rotation, we add to the generation of a new key a crypto puzzle whose difficulty is proportional to how early before the limits the key is generated

staltz commented 1 year ago

Let's see if I'm understanding this correctly (still no rotation included in this diagram):

graph TB
subgraph in-memory on-demand
direction TB
password[user-chosen password]
master[master keypair]
social[social keypair]
end

subgraph on-disk device 1
device1[device1 keypair]
end

subgraph on-disk device 2
device2[device2 keypair]
end

password--"keypair seed = hash(password)"-->master
master--"HKDF"-->social
social--"HKDF"-->device1
social--"HKDF"-->device2

What's the purpose of "social"? Why can't the master keypair directly derive the device1 and device2 keypairs?

gpicron commented 1 year ago

Non social keypair.

Social Identity number is not a keypair. You can consider it as just a contextual sub-identifier of the Master Identity for the Social Media apps.

Actually, we can imagine to propose something like 10 user-chosen printable ascii char + remaining random bytes in the UI for instance to avoid most collisions.

The real unique id for the context "social media" is the compound (pub master key that initially signed it + that ID).

You could even avoid it completely and sign directly the Device keys with the Master Key but I find it interesting to have some contextual information and also you may imagine in the future to have an ID for personal and an ID for professional for instance and therefore 2 Device Key per device for signing posts. This is more or less just an alias. Like a domain name is an alias for IP's in certificates.

staltz commented 1 year ago

So I've been thinking about this, and the tradeoff is difficult.

Option A: Each device has its own keypair

If we go for each device with its own keypair, then it means that peers who want to replicate my posts have to replicate:

My device 1 posts feed
My device 2 posts feed
etc

And each of these feeds has a different cryptographic identifier, so we end up needing to discover those IDs before doing actual content replication. How do we announce those IDs? If you go for a metafeeds-style solution, where some parent feed announces the existence of each device posts feed, then you have these problems:

Overhead in replicating these announcement messages
- What's the best way of hierarchically organizing these announcements? We studied that thoroughly and arrived at a non-trivial tree structure
It's hierarchical, and some device has to publish the announcements, leading to electing one of these devices as the "leader" to publish the announcements, which restricts the flexibility in app design and user experience
Given a device feed ID, it is not immediately obvious what is the "master" ID or "social" ID that owns it, so complex tree traversal has to be performed

These 3 points above are concrete issues we stumbled upon when working on private groups with metafeeds, and both arj and I would like to drop metafeeds, for something simpler.

Option B: All devices share the same keypair

This has the benefit of only one cryptographic identifier, so no need to announce child IDs, or the organize them in a tree and so forth. But the problem is that it breaks the linear feed structure and introduces a DAG for the feed, which has these problems:

We lose the benefit of referring to messages with sequence numbers, and ranges of sequence numbers (useful for both DAG sync, and for PREDSL)
The implementation of a DAG feed is probably going to be complex, have many corner cases, and break several assumptions that so far have been battle-tested in SSB
Merges in the DAG are probably highly dependent on the specific message type at hand, and may require custom logic for each corner case

Conclusion

I am actually not sure which of these two paths to take, but I'd like to explore option B. And I mean with proof of concepts and experiments, in such a way that we don't need to commit to it at this point, we can just learn more about its problems and how grave they are.

PS: about key rotations

I also thought about key rotations in the shower, and I don't think they have to be done so formally and natively inside the system. If someone believes their keypair is becoming weak, they can just start over with a new account. And in the future, we still have possibilities of automating this renewal as protocol improvements.

At least for what I'm working on in PPPPP, the use case is social and casual, not security critical such as banking, healthcare, corporate, etc. There is not a strong story for enforcing 6-month key rotation in such a casual space, especially when you can just manually rotate the keys by creating a new account. Also, we have cryptography experts in the SSB community who have had the same ed25519 keypair for years, they don't seem concerned.

In other words, key rotation is an issue I think we can safely postpone working on. Emphasis on "postpone", not abandon. :)

gpicron commented 1 year ago

Option À:

Try to make the distinction between identity concept and keypairs. They are only loosely coupled. And with the proposal illustrated earlier you don't need a formal feed to announce new feeds and you rely on the gossip nature of the replication to ensure the dissemination of the knowledge of new feeds.

Partial replication:

Let say Alice want to replicate Bob.

When Alice connect to Charlie it will not ask him for a list of device feeds but for all feeds linked to the social identity number of Bob.

Make abstraction of the messages between the anchors. From the social identity number. It forms a DAG with the identity as root and one branch per device.

So the minimum one need to keep in its db to follow Bob is

the Master key "feed" which is not containing announcement of new key pairs but only "master key" rotations and "ban previously" signed keypairs announcements.
the full DAG of anchors from the Identity node.

If we say for instance that a anchor must be emitted at max every 3 months (which means that messages that are signed with the key pair of an anchor with a timestamp delta of 3 months + 1 day are invalid and not diffuse, that means the amount of data to keep for that remain small and compatible with your objectives.

In addition they keep the full messages DAG's of one or more anchors (probably the latest ones)

Back to Alice and Charlie Let's limit for now to 1 identity in common interest to simplify.

what Alice says to Charlie is "I want all messages from Bob identity number since some timestamp"

Charlie do the same and if he is also has interest for Bob they go to the second phase.

Phase 2 Alice and Charlie will send to each other what they know about Bob yet with regards to what they want . That can represented as a Bloom filter or à Bloom clock or any probabilistic set of message I'd structure that contains:

all anchors known for Bob
all messages after the minimum of the 2 timestamps
all messages filling the gap of "previous" to the nearest Anchor

Upon receiving that from Alice, Charlie can deduce:

if Alice knows more than him about B so can expect to receive new messages from Bob from Alice
if Alice is missing some anchors and some messages after the requested timestamp by Alice

And so send to Alice the missing.

In short, we keep a simpler forward replication with forks of all anchors rooted a the social identity + plus all messages from the latest anchor per device before the requested timestamp.

I think it reduce the complexity of the replication, maintain the chain validation (anchor are like lipma links), make diffusion of new device feed awareness transparent, maintain a user controlled storage capacity capability.

staltz commented 1 year ago

Thanks for the explanation, this answers some things but leaves other things confusing for me, so I'll keep on asking.

And with the proposal illustrated earlier you don't need a formal feed to announce new feeds and you rely on the gossip nature of the replication to ensure the dissemination of the knowledge of new feeds.

I understand this part, and yes it seems like it could work.

Try to make the distinction between identity concept and keypairs.

You have explained this before, but I still don't understand how this "social identity number" would be not brute-forceable to arrive at aesthetic IDs (which are more prone to collision). I also think it's unnecessary, and we can just use the master keypair (derived from a password). If you want different contexts, then you can just create a new master keypair from a different password. Seems reasonable that you would have different passwords for your "personal" and "professional" accounts, doesn't it? And it simplifies the system if we remove this component.

When Alice connect to Charlie it will not ask him for a list of device feeds but for all feeds linked to the social identity number of Bob.

This idea works fairly well in the case of fetching Bob's "desktop feed" and "mobile feed" and maybe "tablet feed", but there are more use cases for nested feeds. For instance games and other productivity applications that belong to the same person (or master account). And we have to take into account that some game developers will naively (or on purpose) create too many subfeeds. So just as a litmus test, assume that Bob has 100 subfeeds. And that Charlie knows/has 80 of those.

This idea doesn't give Alice the chance to replicate only the parts that she wants, she has to download everything that Charlie knows about Bob (which are 80 feeds), and then after she has downloaded it, filter for the ones that are relevant to her (say, only 3 feeds: desktop + mobile + tablet), and then persist only the relevant ones. It can be quite wasteful in bandwidth.

Make abstraction of the messages between the anchors. From the social identity number. ... In short, we keep a simpler forward replication with forks of all anchors rooted a the social identity + plus all messages from the latest anchor per device before the requested timestamp.

These sentences made no sense to me, can you clarify?

"I want all messages from Bob identity number since some timestamp"

I'm worried about relying on timestamps for replication. As you probably know, timestamps are not reliable in a distributed system because each peer can generate their own timestamps and there is no guarantee of order, unless all peers get their timestamps from a centralized timestamp server.

I used timestamps for have-ranges and want-ranges in DAG sync for threads+replies, and felt that we needed something better, maybe vector clocks or bloom clocks. Anyway, for the kind of replication you described, I think they would leave too many corner cases and the system could easily be gamed by crafting custom timestamps.

gpicron commented 1 year ago

I also think it's unnecessary, and we can just use the master keypair (derived from a password). If you want different contexts, then you can just create a new master keypair from a different password. Seems reasonable that you would have different passwords for your "personal" and "professional" accounts, doesn't it? And it simplifies the system if we remove this component.

This idea works fairly well in the case of fetching Bob's "desktop feed" and "mobile feed" and maybe "tablet feed", but there are more use cases for nested feeds. For instance games and other productivity applications that belong to the same person (or master account). And we have to take into account that some game developers will naively (or on purpose) create too many subfeeds. So just as a litmus test, assume that Bob has 100 subfeeds. And that Charlie knows/has 80 of those.

This idea doesn't give Alice the chance to replicate only the parts that she wants, she has to download everything that Charlie knows about Bob (which are 80 feeds), and then after she has downloaded it, filter for the ones that are relevant to her (say, only 3 feeds: desktop + mobile + tablet), and then persist only the relevant ones. It can be quite wasteful in bandwidth.

Actually you are pointing exactly the reason to be of the Identity ID abstraction to create different contexts without having to remember several password.

1 Social ID for microblogging using Manyverse, Planetary, etc. (and so 1 device feed per device/app doing microblogging) 1 Social ID for chess game

So for instance, let say I use Manyverse and Planetary Desktop on my laptop, Manyverse on tablet and Planetary on my phone. I use also a Chess App on my tablet and on my phone.

I have

1 master key (password)
2 social identity ID, 1 for microblogging and 1 for the Chess App.
4 device feeds for microblogging (2 on my desktop, 1 on my tablet and 1 on my phone) (those are only replicated by the microblogging apps as they are rooted to my 'microblogging' social identity ID, so Manyverse and Planetary)
2 device feeds for chess (tablet and phone) rooted to my 'chess' social identity ID and replicated only by the Chess App

etc. So Manyverse will never replicate data of a chess game and the chess game app will never replicate posts. So you don't need anymore subfeeds. Moreover, as we want Feed to be forkable, for a chess game for instance, you can create a branch for each game without having to create subfeeds and manage additional key pairs.

You have explained this before, but I still don't understand how this "social identity number" would be not brute-forceable to arrive at aesthetic IDs (which are more prone to collision).

Simple solution: let say for instance that the rule to generate that ID is to apply SHA_256( CONCAT( INITIAL PUBLIC MASTER KEY + UTF8 ENCODED CONTEXT ), with a registry in the SSBC github of the know CONTEXT. For instance, 'microblogging', 'chess', 'git', 'npm', etc.

For the remaining question, I will make a diagram Friday to explain.

staltz commented 1 year ago

That's very decent, thanks for the explanation, I'm closer to buying the idea of contexts.

SHA_256( CONCAT( INITIAL PUBLIC MASTER KEY + UTF8 ENCODED CONTEXT )

Thanks.

Some follow up questions:

It's important that these contexts are not entirely separate, though. For instance, one concrete use case is to start with your microblogging app, and then install a new chess app that "logs in" with your microblogging context ID. This is important to load up your list of friends from the microblogging app, into the chess app. So you can propose new games with people that you already follow elsewhere.

How would the chess app associate or link the IDs of friends in the microblogging context, with IDs of peers in the chess context?

graph TB

master-->micro[microbloggingContext] & chess[chessContext]
micro-->m1[desktop feed] & m3[phone feed] & m4[tablet feed]
chess-->c1[desktop feed]

If Alice follows Bob in the microbloggingContext, she has published a follow msg in that context pointing to Bob's microblogging context ID. Or would Alice point to Bob's master public key? Because in the chess context, Alice can discover Bob's chess context ID only with Bob's master public key. Given Bob's microblogging context ID, Alice cannot discover what is Bob's chess context ID, which is necessary for her to automatically fetch Bob's chess content.

Overall this "traversal" of parent IDs can be cumbersome.

staltz commented 1 year ago

I've been thinking about Option A versus Option B, and I think it would be a lot simpler if we went for Option A (Geoffrey's identity tree), because DAG-structured feeds are a new beast that open up a lot of unknown unknowns, and might make debugging (remote peers) very challenging.

When we started with minibutt, two goals were:

Support multi-device use
Support account recovery without risk of forking

I think Geoffrey's identity tree solves the two issues. Well, the multi-device goal is solved by design, but the account recovery is also an interesting bonus: we don't ever need to "recover" a feed and continue writing to it, we just need to create new device feeds (even if it's the "same" hardware device, after app uninstall and re-install) that should be automatically horizontally tied to the other device feeds.

gpicron commented 1 year ago

Personally I think it is not a good idea to try to implement this crossing a the replication layer. Actually the replication layer don't need the follow / block messages. It receives that information from the upper layer of the sdk. So follow/block messages is a social graph protocol on top of a feed replication protocol.

So there are several potential options:

Having a "social graph" context and associated feeds dedicated to follow/block messages
Having uri based invites
Integrate with phone contact services.

staltz commented 1 year ago

@gpicron I should clarify that, since we're going to have nested feeds (reminder: this is similar to Bamboo's log_ids), the "phone feed" is actually not one feed, but many. One feed for each message type. So there would exist:

"social-phone-contact" feed
"social-phone-post" feed
"social-phone-vote" feed
"social-tablet-contact" feed
"social-desktop-post" feed
etc

This makes it a bit more plausible to replicate "social-*-contact" feeds in the chess app.

I've been thinking about Option A and Option B, and the more I think about it, the more I like Option A (The IDs tree).

Now I've identified one problem already that we need to validate/investigate, but there is another problem too.

Let's call the first problem, mentioned in the comment above, as the :one: Borrow problem, where one app in a separate context (e.g. chess) needs to borrow content from another context (e.g. social). I'm using the word Borrow just because it's short and easy to remember, but the analogy is not to be taken seriously, because there is no "returning" of content.

The second problem is new: If we indeed have several "social-*-contact" feeds, then how do we apply PREDSL with it? Let's call this problem :two: Multifeed PREDSL. Reminder: PREDSL is about publishing messages on a feed such that the "reduce" of these messages creates a data structure, and then the associated deletion of old messages that are irrelevant when reducing this event sourcing system, as well as the active rewriting of old messages to new messages, such that you can tighten the "sliding window" for sliced replication.

Multifeed PREDSL is challenging, because each "social-*-contact" PREDSL feed represents a separate data structure, but what do we do to represent a single data structure? Do we do a "union" of all the individual data structures?

As a concrete example, suppose you have a feed social-phone-follows which only publishes "follow" and "unfollow" messages on your phone. It has PREDSL so that the resulting data structure is a (mathematical) Set, representing all the accounts you follow. Then suppose you have a feed social-desktop-follows which has a similar system, and the resulting data structure is another Set.

The problem is: how do we combine these data structures such that in the user interface they appear as a single data structure? Suppose I unfollow someone on my desktop, then how does that propagate to my phone?

Depending on the data structure, this problem is easier or harder. With follow messages, you end up with one mathematical Set per device, but you shouldn't do the Union of them because that means that if you unfollowed Bob on your phone, but still follow Bob on your desktop, then you're still effectively following Bob. This assumes you could publish a "follow Bob" on both devices. Do we have a way of ensuring that Bob is followed on only one device? Then we could use Union. But I don't think we can, because of offline-friendly use cases, where I could follow Bob on my phone, forget to sync that with my desktop, and then follow him on my desktop.

In this example, we can't use Intersection either because then it assumes that I MUST follow Bob on all devices.

gpicron commented 1 year ago

Let's call the first problem, mentioned in the comment above, as the 1️⃣ Borrow problem, where one app in a separate context (e.g. chess) needs to borrow content from another context (e.g. social). I'm using the word Borrow just because it's short and easy to remember, but the analogy is not to be taken seriously, because there is no "returning" of content.

For this problem. What about :

"social-contact" context (so one ID, and derived feeds per device/app) "social-post" context (so one ID, and derived feeds per device/app) "social-vote" context (so one ID, and derived feeds per device/app)

etc

Then for the PREDSL :

Do you remember ssb-archive proposal ? the output of "SSB-archive" can be a compressed file of the messages for context "social-phone-post" (maybe taking into a deletes and updates) but can be a different structure. Think as a snapshot of aggregate in event sourcing design pattern.

So combining all. The "anchor" messages can be at the same time as "ssb-archive" messages, and the snapshot can be contextually defined.

So for the example in the context "social-contact" (follows/blocks) , the snapshot is 2 lists (blocks and follows), it is the result of last anchor snapshot + all messages until the new anchor within the device/app feed.

For an app, it will be the union of the aggregates per feed corresponding to a context ID of given type.
In the "social-contact" case this is union of 'blocks' ORSet and 'follows' ORSet (CRDT). That way, you can send a "follow X" on the feed on your desktop. An observer of you context ID of type "social-contact", will actually replicate the corresponding feeds on your desktop and on your mobile and can easyly compute the union of the ORSet of each in one ORSet. and see you are following X . In the ORSet you have 2 counters per entry X, one for the number of "follow X" seen and one for the number of "stop follow X" seen. If the first counter is larger than the second -> "follow X" else not.

Then if you send a "remove follow X" on your mobile, the same observer will see it eventually and understand that you stopped following X.

Note: using that solution, if you send a "follow X" on both your phone and your desktop, you will have to send 2 "remove follow X" (can be either on one of the device or on one in each device, it not making a difference). When doing the snapshot at anchor time, you drop entries X where the number of "follow X" equals the number of "stop follow X" to compress the data.

There are often solutions with CRDT structs.

gpicron commented 1 year ago

Ok, I think I was bad at explaining myself... I see a lot of what I tried to propose earlier. Note among other on the generalization of "backlinks" https://github.com/ssbc/ssb2-discussion-forum/issues/10#issuecomment-1455207795 and a generalization of the anchor principles and replication principles in https://gpicron.github.io/ssb-clock-spec/

There is a lot and I will try to comment in order:

Attention point 1

In the first message, you forgot one key point: Timestamps.

Yes timestamps are unreliable to express a verifiable causality as such but... There is one verification rules that is important to keep in mind and can be very useful.

the timestamp of a message MUST be strictly larger than any message referred by the backlinks (previous, tangles, etc.) to be valid
the timestamp MUST be smaller than the receiver own clock. (during the replication, the receiver of a message with a timestamp larger than its own local clock should at least delay it until its own local clock reach the timestamp)

Actually, that can be seen as a local consensus of the elapsed time since the root of tangle or since any "anchor" joining several branches more generally. That's an important information that you can use:

for defining a rule about the frequency of creation of anchors or "skiplinks" based on time elapsed instead of number of messages.
exchanging "want-range" and "have-range" using root id + timestamp
self-determining active tangles that worth OOR replication

Attention point 2

I'm not convinced that lipma "jumps" are needed, maintaining the anchors chains has a minimal memory footprint and make life easier. https://github.com/ssbc/ssb2-discussion-forum/issues/18#issuecomment-1486575922

Attention point 3

I didn't made the graphs yet to clarify as promise in https://github.com/ssbc/ssb2-discussion-forum/issues/16#issuecomment-1480366242. I will do now and show how it generalizes for "everything is tangle"

Replication of feeds based on anchors

As a reminder, we were at that point in our discussion:

This is about the same as, except that I think that same keys signing regular should not be stored on several devices and anchors are the opportunity for key rotation and snapshots CRDT (or ssb-archive).

After the lunch break now. Put on your seat belts, because I'm gonna mix everything together.

When maintaining a single-author tangle (a "feed"), we need sliced replication. See #19 as an example. We need #18, sliced replication. One simple system is to have a some "anchors" which are the black nodes below.
---
title: Regularly space "anchors"
---
graph RL
J-->I-->H-->F & G
F & G-->E-->D
D-->C & B
B & C-->A

A:::rib E:::rib I:::rib

classDef default fill:#6df,stroke:#fff0,color:#000 classDef rib fill:#000,stroke:#fff0,color:#fff


My legend for subsequent graphs.

Note: 
* CID (blue) is not a message but the Context ID (generated by 256 bit hashing the password-based Master Key and the Social Context name)
* Anchor (black) is a message that is signed by the Master Key initially, then by the previous anchor key and contains the new key and the context ID (dark link), there is backlinks to the previous anchor (dotted) and previous message(s) (solid).

```mermaid
graph RL

Ab --> y --previous--> x --> Aa
Ab -.anchor.-> Aa ==context==> CID
Ab == context ==> CID

CID:::context
Aa:::anchor
Ab:::anchor
classDef anchor fill:#000,stroke:#fff0,color:#fff
classDef context fill:blue,stroke:#fff0,color:#fff
classDef master fill:red,stroke:#fff0,color:#fff

Showing a Device feed that has forked:


graph RL

subgraph MK [Master Key]
    MK2 --------> MK1
end

subgraph SC1 [Social Context 1]
    subgraph CID1 [ ]
        direction BT
        CTXID1 ~~~ MK1        

    end

    subgraph D1 [Device Feed 1]
        G --> L --> D1K3 --> D & F --> C --> D1K2 --> B --> A --> D1K1 --> MK1
        D1K3 -.-> D1K2 -.-> D1K1
        D1K1 ==> CTXID1
        D1K2 ==> CTXID1
        D1K3 ==> CTXID1
    end

    subgraph D2 [Device Feed 2]
        D2K3 --> K --> J --> D2K2 --> I --> H --> D2K1 --> MK1
        D2K3 -.-> D2K2 -.-> D2K1
        D2K1 ==> CTXID1
        D2K2 ==> CTXID1
        D2K3 ==> CTXID1
    end
end

D1K3:::anchor
D1K2:::anchor
D1K1:::anchor
D2K3:::anchor
D2K2:::anchor
D2K1:::anchor

CTXID1:::context

MK1:::master
MK2:::master

classDef anchor fill:#000,stroke:#fff0,color:#fff
classDef context fill:blue,stroke:#fff0,color:#fff
classDef master fill:red,stroke:#fff0,color:#fff
classDef hide fill:#0000,stroke:#0000,stroke-width:0px,color:#0000;stroke-opacity:0

Sliced replication using the algorithm proprosed earlier

During the initial exchange each peer send the list of CTXID it is replicating (possibly using a BF) and a timestamp in the past they want to replicate from. The common timestamp is the largest one of the two wish (max)

Then, each peer computes their the intersection of CTXID for which they have a common interest. Then can can determine the oldest anchor before the timestamp requested.

So let's say that Alice has the feed illustrated before. It receives from Bob the CTXID corresponding in his "want list" and the max timestamp is between the timestamp of D nad F on feed D1 and between J and K in Device Feed D2. The oldest anchors before the timestamp are D1K2 and D2K2.

Alice can compute a Bloom Filter (or a Bloom Clock) containing all anchors since the root of the feeds and all messages she know after the anchors D1K2 and D2K2. So D1K1, D1K2, C, F, D1K3, L, G, D2K1, D2K2, J, K, D2K3 + all messages of the Master Key feed (remember that this is the feed that can "kill" a feed when the feed key is lost, so it is important to ensure that it is alway replicated fully as soon as possible but it does not contains a lot of messages, just long term key rotation and kills)

Bob will do exactly the same.

At this point, they have both enough information to determine which messages they have that the other wish and send them to each other.

Generalization to tangles

Let say this is what Alice has in its database for User A and User B.


graph RL

subgraph MKA [Master Key - USER A]
    MKA2 --------> MKA1
end

subgraph SCA1 [Social Context 1 for A]
    subgraph CID1 [ ]
        direction TB
        CTXID1 ~~~ MKA1        

    end

    subgraph D1 [Device Feed 1]
        G --> L --> D1K3 --> D & F --> C --> D1K2 --> B --> A --> D1K1 --> MKA1
        D1K3 -.-> D1K2 -.-> D1K1
        D1K1 ==> CTXID1
        D1K2 ==> CTXID1
        D1K3 ==> CTXID1
    end

end

subgraph MKB [Master Key - USER B]
    MKB2 --------> MKB1
end

subgraph SCB1 [Social Context 1 for B]
    subgraph CID2 [ ]
        direction TB
        CTXID2 ~~~ MKB1        

    end

    subgraph D2 [Device Feed 2]
        D2K3 --> K --> J --> D2K2 --> I --> H --> D2K1 --> MKB1
        D2K3 -.-> D2K2 -.-> D2K1
        D2K1 ==> CTXID2
        D2K2 ==> CTXID2
        D2K3 ==> CTXID2
    end
end

K --reply--> D

D1K3:::anchor
D1K2:::anchor
D1K1:::anchor
D2K3:::anchor
D2K2:::anchor
D2K1:::anchor

CTXID1:::context
CTXID2:::context

MKA1:::master
MKA2:::master

MKB1:::master
MKB2:::master

classDef anchor fill:#000,stroke:#fff0,color:#fff
classDef context fill:blue,stroke:#fff0,color:#fff
classDef master fill:red,stroke:#fff0,color:#fff
classDef hide fill:#0000,stroke:#0000,stroke-width:0px,color:#0000;stroke-opacity:0

Scenario 1:

Bob tell Alice he wish to replicate User A feeds (so CTXID1) and the timestamp is between C and D1K2.

Following the logic explained before, Alice would send a BF containing D1K1, D1K2,C,D,F, D1K3, L, G, MKA1 and MKA2. But Alice knows D received a reply from user B after the timestamp of D, so after the anchor D1K2 (because of the rule that timestamp of messages must be strictly larger than the ts of the backlinked messages). So Alice will actually include the message K, J, D2K2, D1K2, MKB1, MKB2 in the BF.

Even if the K is not part feeds of the common Context ID set negociated in the first phase, it is logically included in the common set of messages to sync.

Scenario 2:

Bob tell Alice he wish to replicate User B feeds (so CTXID2) and the timestamp is between J and K.

The common set of message to sync is MKB1, MKB2, D2K1, D2K2, J, K, D2K3 + D, C, D1K2, D1K1, MKA1, MKA2.