multi-writer for single user on multi-devices and clouds

urbien commented 4 years ago

the implementation for this issue was released as multi-hyperbee

Problem

Hypercore is a single-writer system. In a multi-device scenario (e.g. a mobile and a PC) we need to find a way to present one [virtual] Hypercore across all devices. We should include into our devices also personal Cloud peers to take advantage of their reliability.

Existing approaches in Hypercore community

Materialize merged Hypercores into one (Kappa-DB does that). Drawbacks: 1.1. duplication of data (confirm the extend of it?) 1.2. loses sparse access 1.3. loses the authenticity of individual records (confirm?) 1.4. loses performance
Master feed + deltas. It is developed by @rangermauve for Multi-Hyperdrive. The algorithm is specific to Hyperdrive, but can be extended to a degree to Hyperbee and Hypercore. Extending it to Hypertrie is easy as Hyperdrive is built in Hypeptrie and has the same access semantics (minus the file contents).

Evaluation of Multi-hyperdrive

peer1: me r2 r3   r2 = hyperdrive(peer2-key, sparse)
peer2: r1 me r3
peer3: r1 r2 me

How it works

Each peer creates it’s own hyperdrive and replicate sparsely peers’ hyperdrives
Each peer sets top-level watches on replicas
Upon file change on other peer, a notification is pushed to watching peers
Received notification has only metadata of the changed file
Watcher decides what to do with this info. The data must be fetched from remote peer IF we want to uphold offline-first. Alternatively, if we are willing to sacrifice offline-first somewhat, we get fetch the data from remote peer on-demand, when file is requested locally.
The next request for the fetched file will look through its own hyperdrive and all the replicas and return the latest version of the file. Same with readdir, but with a bit more work.

Pro

Replication model: Leaderless multi-master, all peers are equal
Offline-first: each peer has full set of data locally, if it does pre-fetch on watch()
Storage: no duplication of data for new files created on different peers
Network. Fan-out on metadata change, but largely insignificant
Backup is straightforward, as multi-hyperdrive automated walk-through data changes stored in different feeds

Con

Network: painful fan-out on a data change. If pre-fetch on watch event is used it number of peers in the swarm * data-size. If pre-fetch is not done, the file still can be requested on more than one peer, and it will be uploaded more then once. Besides, the mobile may not be online to provide the data (waking up mobile app in background mode is possible but it has limitations).
Storage: A file edited on a peer duplicates the storage for this file on every peer. But it is the same cost as in a single-writer, since Hyperdrive today does yet not support block-level file modification. When Hyperdrive fixes this issue, multi-hyperdrive design will need to be adjusted.
Performance: N file stats per get(), probably not significant
Consistency: can end up with different state of each master (needs CRDT and clocks)
Collaboration: coarse conflict resolution, but can be improved with CRDT and file metadata
TODO: Is tracking changes reliable? can watch events get lost, e.g. when peer dies?

Approach with Personal Cloud nodes

Upload strategy and topology

To optimize replication for Multi-hyperdrive we can distinguish between the capabilities of the peers, taking the following into account:

Cost. Peer could be on a metered network, this is typical for a cellphone network.
Speed. Peer could be on a slow network, unlimited but slow like DSL
Latency. Peer could be on a network with high latency, like satellite
Availability. Peer is not always on, like the mobile and web app

Discovery of topology

Multi-hyperdrive topology is any-to-any. We need something different here.

Originator of the change can discover the capabilities of the Peers in the swarm (a separate project that would utilize small DHT storage, already used by Bitfinex), and adjust replication strategy in the following ways:

Change is always uploaded in chunks to multiple peers, but for mobile this is not good. It it better to upload to Cloud peers, as opposed to the iPad for example, as iPad may, and will, go to sleep any moment, and won't support stable content propagation.
Even more importantly, each chunk should be uploaded just once (not what happens today), to preserve bandwidth and costs while on the cell phone networks. In other words, Cloud peers should prefer to replicate from each other, not from mobiles. See the discussion on that.
Exception is AirDrop-like network scenarios, when devices replicate directly on local network, and that want to avoid cloud altogether

Merge strategy

Hyperbee and Hypercore need different merge strategy from Hyperdrive. Multi-hyperdrive does not materialize merges. But for Hyperbee especially, this could be unavoidable. Data can be fetched on watch and immediately copied to local feed, thus allowing searches. Hypertrie may continue to use the same strategy as Hyperdrive.

Now, how do Cloud peers achieve consensus?

Consensus research

Simple and fast Consensus on Cloud Peers. Because Cloud peers are always available and are on a fast network, consensus algorithm can be simpler and greatly reduce the probability of inconsistency. Time in Cloud can be managed well, further simplifying consensus algorithm. We can start with Cloud Peers in the same data center, but on different machines, and even different racks, and maybe different zones in the same data center for power isolation. Then we can develop an equivalent to AWS availability zones with a more complex consensus algorithm.
Which consensus algorithm? Consensus research for Databases has been supercharged with the advent of Blockchains. EOS blockchain demonstrated that if we assume all peers are on a fast reliable network with low latency, a much simpler consensus algorithm becomes possible and it converges 2-3 orders of magnitude faster (EOS can do 3000 transactions per second). 1.1. Non-Byzantine algorithms used in databases are Paxos and RAFT. 1.1 PBFT is quite mature, supports Byzantine faults, but requires (n-1)/3 nodes (so minimum 7 nodes?) and has difficult leader selection. 1.1 Tendermint improves by rotating the leader every round, and skips non-responding leader automatically (how many peers minimum?)

Leaderless non-Byzantine consensus

We set out to support multi-device and team-collaboration scenarios. Most changes are expected from personal devices. Later, Cloud App will also generate new data as well. We will limit by the data type what changes Cloud peers can initiate so that they do not clash with a single-writer model.

If we design do non-Byzantine faults we can be make use of a new approach for leaderless multi-master, used by AWS DymamoDB and Azure Cosmos. It is based CRDT innovation that occurred in the last 5 years.

CRDT merge

Merge changes into master with the help of CRDT, used in Redit, as well as Cloud-scale databases AWS DynamoDB and Azure Cosmos. Use yjs or https://github.com/automerge/automerge).

Secure Clock

Use vector / bloom / HLC clocks to resolve conflicts, to achieve 100% the same state on all nodes, eventually :-).

pfrazee commented 4 years ago

Effectively I understand this to be a "hosted hypercore with offline-capable updates from client devices." It's notable as a variation of a simple "hosted hypercore," which is a hypercore that exposes its API to the network to accept writes from clients via RPC.

In your proposed model, each client device maintains a set of deltas (which I might call the "staging cache") which it can sync to the hosting device at any time. Until sync, the client device will read a union of the currently-known master hypercore and its local cache, enabling live writes in an offline context.

A "hosted hypercore" model loses the ability to accept writes in an offline context. This proposal can continue to accept writes when offline, but will not propagate the writes to the network until the client device can sync with the master device. It is a simpler model than masterless multiwriters, but it maintains offline-first abilities.

1.3. loses the authenticity of individual records (confirm?)

Not sure about the other properties but I'm pretty sure each writer uses their own hypercore and so authenticity isn't lost.

The solution is simpler if it is for personal use only. For teams we might create a different solution.

I usually refer to this as "multi-device" vs "multi-author"

Leader rotation & consensus algorithm

What requirements do you have which require leader election and a consensus algorithm? Are you assuming the master would not be able to appoint a new leader, eg due to catastrophic data loss?

urbien commented 4 years ago

@pfrazee thank you for the comments! I will review in full, and react to all, but want to share the correction to the first statement. Our intent is not just a hosted Hypercore. We want Cloud Apps to work there too. So updates will be initiated both in the cloud and on devices. In this way they are symmetric. But they may not need to be symmetric in the way updates are pushed. For example, a 10MB file does not need to be uploaded from a device to all Personal Cloud replicas and to other devices. This would be quite inefficient, and costly on a data plan. So updates could be sent to one of Personal Cloud replicas to be disseminated further.

RangerMauve commented 4 years ago

Great writeup! There's some cool stuff in here. I think you might also want to look into what CoBox made with their Kappa-Drive module which is kinda in between co-hyperdrive and kappa.

Regarding co-hyperdrive/multi-hyperdrive:

I don't think the watch() function actually downloads data unless you do some sort of FS operation like readdir on watching
The consistency is actually decent if peers are online and replicating. Notification of changes will go pretty fast through all the peers. If all peers see the same latest index for all the feeds, then they will resolve to the same view of the "latest file".
That is to say, It's "eventually consistent". vector clocks and the such won't change the consistency guarantees but will help for the cases where two peers are writing at the same time but have clock skew.
I don't think backups are quite as complicated as you're imagining. In order to backup, you send the URL of the co-hyperdrive you wish to download, and it'll automagically be downloading from all the sources at once. This is happening in the Natakanu project with decent success, and there's lots of room for improvement.

I'm excited to see what your new system will look like once it's out, and thanks again for putting these docs together!

pgmemk commented 4 years ago

Added a section on how multi-hyperdrive works.

urbien commented 4 years ago

Thanks @RangerMauve! Will definitely try to grok Kappa's multi-writer algo, might need someone's help in their community. @pgmemk and I have already played with multi-feed, and will explore KappaDB.

I don't think the watch() function actually downloads data unless you do some sort of FS operation like readdir on watching

Right, I was hypothesizing that if you want to uphold the offline-first principle, you would download upon getting a notification in Watch. Guess you do not do that after all :-)

The consistency is actually decent if peers are online and replicating. Notification of changes will go pretty fast through all the peers. If all peers see the same latest index for all the feeds, then they will resolve to the same view of the "latest file".

Agree. But if they are not online, then file stat() will be the time of file download, not time of update. This is tolerable for individuals. iOS Notes are that way. But I am spoiled by Google Docs.

But my biggest concern is different. Even if bad merge decision is made, it is tolerable. What becomes hard to tolerate is replicas ending in different state. They can continue to diverge from there. As we discussed, it would be useful to understand how to achieve multi-writer for Hyperbee, Hypercore, not just Hyperdrive.

I don't think backups are quite as complicated as you're imagining.

Noted, you took care of it! This is great. I will move it from Con to Pro.

urbien commented 4 years ago

@RangerMauve - How does adding new peer work in multi-hyperdrive? Or to be more specific, I can understand it can be done via backup / restore. But can it be done by asking for data from another peer?

RangerMauve commented 4 years ago

@urbien multi-hyperdrive doesn't do anything for adding peers automatically and relies on the addDrive API to be called by the application.

co-hyperdrive adds in the automatic writer tracking by storing writers in the hypertie. When you authorize() another writer key, it get saved to a hidden key inside an existing writers hypertrie. Other writers will detect the change in the hypertrie and then make sure they've loaded the full list of writers / detected removed writers and purged them.

So with that you can be sure that if you authorize() a new writer (provider you're already a writer), anyone else replicating will see the change. I've tested this in the Natakanu app and it's honestly a little magical how it just works.

RangerMauve commented 4 years ago

Regarding the updated cons for co-hyperdrive:

Network: painful fan-out on a data change. If pre-fetch on watch event is used it number of peers in the swarm * data-size. If pre-fetch is not done, the file still can be requested on more than one peer, and it will be uploaded more then once. Besides, the mobile may not be online to provide the data (waking up mobile app in background mode is possible but it has limitations).

This isn't any different than downloading all the data of a single writer. No sure "will be uploaded more than once" would mean. When a peer tries to load a file, it'll see who it's connected to that has the data and fetch individual blocks once between all the peers. It won't load the same block twice. Mobile peers being offline is the same as the single writer scenario and requires backups of some sort for higher resilience in the same way.

Consistency: can end up with different in state of each master (needs CRDT and clocks)

This is only true for when peers are offline and aren't replicating with anybody. This is the same regardless of whether you're using CRDTs or clocks. The only alternative is not allowing reads when you're not replicating.

Collaboration: coarse conflict resolution, but can be improved with CRDT and file metadata

This is partially a limitation of hyperdrive not having coarse random-access reads.

TODO: Is tracking changes reliable? can watch events get lost, e.g. when peer dies?

If a peer is offline and it writes, then there's no way for other peers to get it's new data, I don't think there's any way around it except preventing peers from being able to write if they don't have any other peers.

urbien commented 4 years ago

on the following issue:

TODO: Is tracking changes reliable? can watch events get lost, e.g. when peer dies?

If a peer is offline and it writes, then there's no way for other peers to get it's new data, I don't think there's any way around it except preventing peers from being able to write if they don't have any other peers.

I meant that a watch() was triggered, but right at this moment the process died before we initiated something that would persist it. Will this change be lost or there is a way to receive this watch() again? I know in multi-hyperdrive this is not needed, but in multi-hyperbee that we are designing, we will be applying the change in a peer's clone to a primary hyperbee.

pgmemk commented 4 years ago

I just realized that the merge should be done at the stage when resource is about to be submitted

Some flow example:

User uses 2 devices: phone and laptop

User makes the changes on his laptop.
- Receives an urgent call.
- Closes computer without submitting the changes
User takes his phone with him.
- Some idea strikes.
- He makes changes on the phone to the same resource he started making changes on his laptop.
- Submits the changes
Back home.
- Opens computer.
- Makes some more changes (to already stale resource).
- Submits the changes

We ended up with two versions of the same resource that need merging before submitting. In Tradle we can detect that the version of the resource that is been changed is not the same as the most current one That's where we can initiate the merge

urbien commented 4 years ago

first alpha release of multi-hyperbee is out and passes a bunch of tests. The work continues to:

adding more tests
save / restore related feeds
and to make Dynalite (DynamoDB emulation on LevelUP) work in embedded mode (removing HTTP marshaling)

See the roadmap on multi-hyperbee repo

tradle / why-hypercore