holepunchto / hyperdrive

Hyperdrive is a secure, real time distributed file system
Apache License 2.0
1.86k stars 134 forks source link

State of multiwriter? #230

Closed aral closed 1 year ago

aral commented 5 years ago

I was just wondering what the state of multiwriter support is in Hyperdrive currently. I see that the last work on multiwriter was around four months ago.

Is it still within the near-term roadmap to integrate multiwriter into mainline hyperdrive and DAT CLI?

RangerMauve commented 5 years ago

I think @jimpick has been working on it in his spare time. He's got a fork with the latest and greatest changes somewhere.

pfrazee commented 5 years ago

It is still being worked on! @mafintosh has been steadily finalizing it while also working on the new discovery code. Jim has also been building out those test demos, as Ranger says.

Frando commented 5 years ago

It would be great if the changes by @jimpick could be pulled in. I didn't go through all of them, @jimpick would you do a PR to here? Or is there still experimental stuff in your branch?

It had some fixes that I needed so I started from there for some patches, which I could also try to rebase on @mafintosh's branch but wouldn't do that if not needed.

jimpick commented 5 years ago

It's still pretty experimental... the last month has been very busy with travel and holidays, so apologies if it all seems stalled.

There are a lot of hacks in there to get it to work as much as it does, so it's not really production quality.

I've noticed that with multiwriter, when things don't work, it often results in deleted files, so it's probably not a good idea to just release the changes "into the wild" without a lot of review and discussion.

I gave a talk in Tokyo where I showed the Dat multiwriter command line interacting with a version of dat-shopping-list (with updated dependencies). I had to make a few quick fixes for that to work... the latest commit is here:

https://github.com/jimpick/hyperdrive/commit/a2648fbdef84041cc78b7d1efe5c92c407895242

Here are the slides from my talk:

https://tokyo-dat.jimpick.com/

If I can find some time in the next month, I really should write that up into a blog post.

I had previously talked with @joehand about possibly releasing the pre-release version using his dat-next npm package.

aral commented 5 years ago

@jimpick What are your thoughts on multiwriter vs multi-feed?

(I see them as complimentary: multiwriter for the use case of a person authorising additional devices for their master feed and multi-feed for “conversations” between different people.)

Would be great to perhaps get a small working group together to scope this out (e.g., lack of deauthorisation is an issue… one that could potentially be worked around in app implementations via a combination of a pasword-derived keypair and logout messages for specific devices.)

I’d love to be involved with this as it’s central to what I’m building.

CC @noffle

jimpick commented 5 years ago

Multifeed is really useful when you want to replicate multiple feeds over a single replication stream. I actually built something very similar inside hypermerge (I was inspired by what hyperdb did internally). Hyperdb (and multiwriter) could probably even use it internally to manage the hypercores it replicates...

okdistribute commented 5 years ago

Hey all, @mafintosh and @andrewosh just put out a new hyperdrive release candidate, 11 days ago: https://github.com/mafintosh/hyperdrive/blob/3877eec49e6c0244b0df62d57cbc9e57b4d9224e/README.md

This is the first step towards multiwriter.

Time to try it out and give feedback and improve it! Really excited to see this roll out this year.

aral commented 5 years ago

This is so exciting; can’t wait to try it out (just working to finalise Indie Web Server, one of the core components of Hypha so the timing is perfect). Thank you for the heads up, @karissa :)

100ideas commented 5 years ago

@aral last year @jimpick experimentally patched hyperdrive to support hyperdb + multiwriter storage. He published it as a package on npm, @jimpick/hyperdrive-next. I think the patched source code is here: https://github.com/jimpick/hyperdrive/tree/multiwriter-staging

I was curious about the potential to use dat + hyperdrive to create bidirectional filesystem synching between server & web app, so I made a little prototype: https://github.com/100ideas/hypercore-protocol-proxy-test

Its outdated by a year but it does show how to get duplex replication + synching set up between a dat archive on a server and a dat archive in the browser over a private websocket connection.

Also, while looking up some of these links, I came across @karissa's peerfs experiment, which was last updated may 2019 and is described as "multiwriter peer-to-peer filesystem, built on kappa-core and hyperdrive". It looks like it uses more recent (kappa) approaches to setting up multi-hyperdrive. https://github.com/karissa/peerfs.

Lastly these two tutorials show the way for setting up private key exchange for multiwriter situations and explain kappa architecture a bit better. (I haven't been paying attention to dat for the last year so I found them informative, apologies if they are well known already)

aral commented 5 years ago

Thanks @100ideas :) I’m still wondering if there’s a definitive timeline for multiwriter support in DAT. I don’t want to try and implement my own version only to find that it’s incompatible with DAT in the future. I know @mafintosh stated he was working on this in March. I’m waiting on this and working on other bits of my system in the meanwhile (mostly seamless deployment of the untrusted always-on web node).

RangerMauve commented 5 years ago

From what I understand, @mafintosh and hyperdivision got a contract to work on multiwriter by December. And it looks like their current path towards that is called "union mounts". Before that though, they'll be working on a system that enables you to share a secret key between devices with a third party service for coordinating writes.

alterx commented 4 years ago

Good afternoon folks,

I've been looking into a fully decentralized option to build applications (I've also looked into textile.io, GunDB, and SSB) and so far Dat seems pretty close to what I need. I've been following its development for a long time and multi-writer is something that I think will give Dat the upper-hand over other technologies such as SSB.

With that being said, I've found several packages that claim to achieve multi-writer capabilities on top of Dat. Specifically, I'm talking about kappa-drive (previously known as peerfs) and hyperdb. While these seem like viable architectures it's very difficult to know if they're going to be part of the "main" Dat "stack". It feels like there's little information about what's the state of multi-writer, what's the "official" solution (if any), etc. I understand that this is a project with limited resources (both time and monetary) but I can't shake off the feeling that communication around this feature (which I feel is crucial to driving mainstream adoption) hasn't been great, leading to some confusion around it.

hackergrrl commented 4 years ago

@alterx My impression has been that "multiwriter" has a broad solution space, depending on your needs:

There have been lots of different ways to build on top of hypercore, the base append-only log abstraction. They are all legitimate solutions to multiwriter problem, each with their own communities. "Official" is kinda a weird thing, since whatever direction, e.g. hyperdrive goes in, the other modules that don't depend on hyperdrive will continue to exist and grow too, whether it bears the Dat logo or not.

Short answer: pick a solution that meets your needs! They won't vanish or stop being useful even after hyperdrive picks a multiwriter approach. :)

alterx commented 4 years ago

@noffle I still think I'll wait for whatever hyperdrive ends up doing (it feels like it will probably be flexible enough to support most of the use cases you mentioned). Your answer really clears thing up for me though :), thanks!

mriise commented 4 years ago

It really would be preferable to wait for hyper protocol to make an implementation that allows multiple writers with their signature so that you can track that "publicKey made X change". In reality, I'm not sure if that's actually part of the plan... Either way, I think it's worthwhile to wait and see what they do.

okdistribute commented 4 years ago

@mriise check out https://github.com/kappa-db

mriise commented 4 years ago

@okdistribute thanks! I had looked at it earlier and it seemed a bit too much for what I had in mind. A simpler solution I came across (in one of @pfrazee 's templates) is to just use drive mounting to aggregate user data (and it allows users to own their data).

jalcine commented 3 years ago

Bumping this because I have an interest in using Hyperdrivers as a storage layer for my project’s Micropub server - all IndieWeb technologies.

I have seen https://github.com/kappa-db/multifeed but I’d like to see something a bit more supported - especially since https://github.com/kappa-db/multifeed/issues/19 doesn’t seem to be a feature; something that would be critical to this.

(Originally published at: https://v2.jacky.wtf/post/df01b773-02c7-48e1-8c60-1466d0577a89)

RangerMauve commented 3 years ago

@jalcine One thing you could play around with is multi-hyperdrive which lets you sandwitch a bunch of hyperdrives on top of each other, it's a lot less opinionated than multifeed stuff and works well with sparse replication of metadata (better for huge datasets)

With it you can leave it to your application to figure out where the drives come from and to load them up together.

If you want something a bit higher level, there's co-hyperdrive which builds on top of multi-hyperdrive and stores references to writers within the hyperdrive itself / auto-loads them.

This lets your application decide when to add another writer (can only be done by a peer that's already a writer) and is an eventually consistent system.

These modules work best when you have peers uploading different files which are unlikely to conflict or are changing the same file slowly over time. it's not a good building block for multi-user collaborative editing.

This is being used in the Natakanu project and is working pretty well for multi-user file sharing.

jalcine commented 3 years ago

Oh, this is sick. Thank you @RangerMauve for introducing this to me!

(Originally published at: https://v2.jacky.wtf/post/ab491578-0717-43e9-abf3-4fffa02309ba)

raphael10-collab commented 2 years ago

@RangerMauve What do you suggest among the available hyper modules to use for a multi-user collaborative editing?

RangerMauve commented 2 years ago

@ralphtheninja Depends on what your data model is like. You might want to look into other things like y.js . Else you can update something like hypermerge which uses CRDTs to represent data.

One worry I have is that a lot of the stuff building on top of hypercore for collaborative editing doesn't use sparse data structures so as your history gets longer it'll get way slower, especially if new users join in. 🤷

raphael10-collab commented 2 years ago

@RangerMauve does your worry of the stuff built on top of hypercore not using sparse data structures, so as my history gets longer it will get way slower, especially if new users join in, is the same issue of performance degradation over time, described in the pushpin app which uses hypermerge ?

https://github.com/automerge/pushpin/blob/master/WARNINGS.md#performance-may-degrade-over-time

Does this problem occur also with y.js ? I'm trying to grasp the "internals" of y.js, but as far as I understand, I do not see an mention to sparse data structures : https://github.com/yjs/yjs/blob/main/INTERNALS.md If yes, which CRDT's according to you implementation uses sparse data structures, and therefore could mitigate this important problem? https://crdt.tech/implementations

Or might be the combination of hypercore + CRDT the source of this performance degradation over time?

https://www.kn8.lt/blog/building-privacy-focused-collaborative-software/ :

"To create a multi user system you’d have to combine hypercore feeds from multiple people to reconstruct a unified view. Imagine a hypercore powered p2p Twitter. Everyone would post updates to their own hypercore feed, and you could then download feeds of all the people you follow from the p2p network and combine them into a unified feed view. Conceptually, this is in essence how Scuttlebutt works.

This same technique could work well in our Google Docs alternative. First, for a given document every collaborator would append CRDT updates to their personal per document feed. Then, every collaborator would download the personal feeds of other collaborators and keep them in sync (both in real-time or asynchronously, hypercore is capable of both). The CRDTs would then get combined to reconstruct a consistent view of the document. Additionally, in a p2p setting, hypercore’s merkle tree signatures would be useful for verifying that updates were not faked. We could make sure that someone with read only access to a document is not able to impersonate other users and fabricate updates. This while still allowing any peer to seed the full data set to any other member of the workspace.

This all sounds almost too good to be true. And unfortunately that might well be the case. As mentioned previously, while CRDTs are conceptually an append only system, to make the storage efficient it is important to apply compacting techniques as done in Yjs. And hypercore feeds are immutable append only logs. This makes it difficult to reconcile the ideas behind hypercore and efficient implementations of CRDTs. If you write every individual CRDT update to a hypercore feed, such feeds will grow in size too quickly."

@RangerMauve how to make hypercore and Yjs, or any other CRDT implementation work together without worsening the performance over time ?