subconsciousnetwork / noosphere

Noosphere is a protocol for thought; let's discover it together!
Apache License 2.0
667 stars 40 forks source link

Question about multi-device / conflict resolution. #472

Open Nuhvi opened 1 year ago

Nuhvi commented 1 year ago

I have been reading explainer.md, and it appears to me that this statement is true:

  1. Spheres are meant to be append-only like git where each update needs to be appended to a previous block/commit.
  2. There are no conflicts to be resolved.

However, these two statements can only be true, if the writer, always has access to the latest Head of the log, which can only be possible in one of these cases (I assume):

  1. There is only one device/process writing to any Sphere.
  2. There is a server or a peer that is always on (like Github) helping devices perform CAS.

The problems with single device editing are:

  1. Deleting the app then reinstalling it again, still puts you at risk of writing to a zero commits branch, before peers connect and make older branch available, in which case, you need to rebase, and manage conflicts.
    • I guess you can always create a new Sphere and switch pointers using the DHT records.
  2. Well, you can't use Noosphere in any other app while maintaining a unified identity, unless the readers are tasked by reading multiple Spheres at a time, which again, they will have to apply some conflict resolution algorithm, even if it is Last Write Win.

So basically, my question is, which if these conclusions (if any) are right?

cdata commented 1 year ago

@Nuhvi thanks for the inquiry!

There is a server or a peer that is always on (like Github) helping devices perform CAS.

This. Synchronization is mediated by a gateway (e.g., the one that runs when you use our CLI orb serve) which plays a similar role to a git server. The gateway enforces that any new changes by authorized clients are based on history the gateway already knows about (currently that is the extent of any enforcement). There is a fetch/rebase/push flow that occurs as a part of synchronization with the gateway so that clients have an opportunity to reconcile against the latest canonical history before pushing new changes. The gateway, in turn, is tasked with coordinating between IPFS and the Noosphere name system to ensure that the latest history is more broadly available in the network.

Note that synchronization is potentially lossy: it is possible for client A to overwrite a change by client B if both of them wrote to the same slug at overlapping Lamport time. But, our current model accepts this trade-off for the following reasons:

We have speculated about ways the protocol may evolve to support direct P2P synchronization flows, but do not have any conviction about the best path here. And, the current arrangement (with the gateway mediating changes to history) suits the Subconscious use case, so we won't be rushing to implement such synchronization before its time. Suggestions and/or further inquiry on this topic are always welcome, however.

Here is a diagram that I made as part of our WIP documentation efforts. It is meant to explain how our nascent managed infrastructure is arranged, but includes some topical detail related to synchronization:

Subconscious_Cloud_Diagram
Nuhvi commented 1 year ago

@cdata All reasonable. Although I have few comments:

  1. I don't think version control at the base layer is really needed, but it is a common bias, for example At Protcol shaped their data stores like git, only to end up rebasing it frequently to reclaim storage.
  2. Sync, which is probably the reason we go to git in the first place, can be done with a rather simple key-value CRDT, if you are willing to accept size complexity of O(devices per user * slugs per sphere)
  3. I really don't understand the name system, and would encourage you to instead have a good old BEP044 record saying find my root at this gateway, this means users need to check a DHT very infrequently, but updating the root of the sphere, can be as fast and frequent as a gateway can update its internal database, and requests can be as fast as the ping time to that gateway.

I love the visual style of this diagram.

cdata commented 1 year ago

I don't think version control at the base layer is really needed, but it is a common bias, for example At Protcol shaped their data stores like git, only to end up rebasing it frequently to reclaim storage.

That's an interesting perspective. As with the rest of synchronization, I see what we have now as a starting point. And, yah, I'm anxious about storage cost over time as well, so I could see us wandering down the same path by necessity.

Sync, which is probably the reason we go to git in the first place, can be done with a rather simple key-value CRDT, if you are willing to accept size complexity of O(devices per user * slugs per sphere)

:memo: :+1:

I really don't understand the name system, and would encourage you to instead have a good old BEP044 record saying find my root at this gateway, this means users need to check a DHT very infrequently, but updating the root of the sphere, can be as fast and frequent as a gateway can update its internal database, and requests can be as fast as the ping time to that gateway.

We maintain the assumption that gateways are interchangeable and quite possibly ephemeral. Eventually, gateways in our managed infrastructure will not be long-running processes at all. Therefor, we do not ask that any user seek to interact with a gateway just to discover their peer's name record.

That said, the name system as displayed in the diagram really is just intended to be a distribution system. The name records are self-verifying, and may be distributed by any means. As an example of this, in Subconscious (our app) a user may adopt a name record directly from another user's address book (by way of tapping follow on one of their peer's follows). This leaves open the possibility for distribution of name records by other means, including (but not limited to) a scheme similar to what you are describing, so perhaps we'll end up there eventually. We also want to explore distribution of name records via DNS as well as name system-like smart contracts such as ENS.

Nuhvi commented 1 year ago

DNS is the undisputed king of discovery. Pkarr was an attempt to just add a censorship-resistant fallback to it, while being quite strongly opinionated that discovery should be done one way and one way only, and that one way is supposed to be DNS lookup => good old web server.

So I actually think the gateways are good, and p2p gossip should be abandoned for good. As long as users can switch gateways/servers/personal data stores/cloud providers whatever you call them on a whim, just like I can switch my DNS records to switch from Netlify to self-hosting while needing no permission from Netlify whatsoever.

Keep up the good work, I will try to share what I got if I managed to build a demo for the key-value CRDT I mentioned above.

cdata commented 1 year ago

DNS is the undisputed king of discovery. Pkarr was an attempt to just add a censorship-resistant fallback to it

Our motivations here are very similar, and I'm still interested in playing with Pkarr as part of a distribution scheme (maybe paired with a resolver that knows to look for UCANs in IPFS so that records in Pkarr can be kept within an acceptable size).

So I actually think the gateways are good, and p2p gossip should be abandoned for good.

Yah, a P2P scheme has a lot to prove IMO and probably shouldn't be the only option for users. Although DNS is ubiquitous, it has some well-known downsides as well. That is why we have designed for distribution to be its own layer: it gives us the flexibility to experiment and implement as many strategies as we want, in whatever order of precedence we prefer.

Thanks for the feedback, and hopefully you won't mind if I bug you about Pkarr in other channels :sweat_smile:

Nuhvi commented 1 year ago

@cdata That if I didn't bug you first to review my rewrite in Rust :)