ssbc / ssb-db

A database of unforgeable append-only feeds, optimized for efficient replication for peer to peer protocols
https://scuttlebot.io/
MIT License
1.17k stars 75 forks source link

simplest way to handle attachments? #9

Closed dominictarr closed 9 years ago

dominictarr commented 10 years ago

I'm trying to figure out the simplest way to handle attachments.

thinking out loud here

I originally thought that the answer was to put them through as a part of the replication protocol.

Then I thought hey maybe we could cut around that and make something simpler by having using (at least in the early versions) just using http, and requesting or posting hashes.

But then I realized that it depends on where you expect those hashes to be.

Okay, it generally helps to talk about a concrete usecase.

So: oakdb. oakdb is a secure database such that you might implement a package manager. there are two operations: 1) give a hash a path (you sign a record saying you name hash X) 2) link a path to another user's path (you sign a record saying your name for another users's path is Y) think of this as a symlink that is also a certificate.

to get to the hashes that I have named you could use a path like this:

MY_HASH/foo/1.0.0 which would point to a the correct hash for foo@1.0.0

These hashes are probably tarballs, of course. you could use this with a central registry. The central registry could just be a "meeting place" or it could have be a root namespace, maybe so that I can leave off the hash of the registry's key. instead of REGISTRY_HASH/foo/1.0.0 I could just say /foo/1.0.0. I would just have to request to the registry to bless my foo module in some way.

Now, when I publish a module, I'd need to push the tarball as well. In couchdb you publish by pushing to the registry, but in oakdb we have a local replica, so we just put the message and tarball in that and then replicate with the registry.

Or maybe we just have a tarball store beside the registry, and on publish make sure that the tarball store has the tarballs we have published? If we push the tarballs in order we'd just have to remember the last tarball we sent it.

Suppose we did have a central registry that gave us certificates for module names? we should probably only put the certified modules in the tarball store. it's possible to also have a tarball that isn't published to the registry - say a fork. So which tarballs belong where?

is the answer to traverse the links and put the tarballs which are reachable along a path into your replica?

Okay so lets say we can calculate whether we want a given tarball. can we also calculate whether another node should want a given tarball, or do they have to tell us?

If they tell us, do they need to know that we may have their tarball? Say, we have either published it, or linked to it?

Suppose each app indexes who published/reshared what and then used that info to figure out where to get something?

dominictarr commented 10 years ago

maybe the answer is for a node to anounce all the tarballs they have in their feed?

currently I have 2344 tarballs in my npmd cache. if they where all 32 bit hashes that would be 75k. If a message is 1024 bytes long, that is 32 32bit hashes. that makes 73 messages. this is probably a few months worth... I guess this is okay.

Another way to do it is to exchange a want-list (list of the hashes you want) with a replicating node) the size of the list could be configured, so that a node does not overload another node. (ultimately things would need to drop connections etc, if a node misbehaves, like in bittorrent)

dominictarr commented 10 years ago

this all seems to come back to a want list. and maybe allow apps to hint where to get each package from.

dominictarr commented 10 years ago

That is basically what bittorrent has - a DHT that maps infohash -> ip addresses. In this model we'd map object hashes to pubkey hashes. then the node can exchange those objects when replicating.

hmm.

maybe I'm over thinking it.

what if the answer is: when replicating feed all messages through apps for indexing. if you see a hash that the app wants, the app requests it immediately.

If an app wants something while replication is underway, a message is sent to the replica which will then send the object if they have it.

Sometimes the connection will fail, and the replica will not receive the object. it needs a way to retry. maybe it just adds it to the waitlist which it requests from every node?

hmm - thinking about this in the context of a centralized package manager... this would cause problems because the missing objects would build up and then there would be too many things, and it would want to request them all from all the nodes.

Ah, but this isn't really a problem if you DONT have a central registry!

Hmm... so if you did have a decentralized package manager you would probably want messages that you installed or have "starred" something... which would obviously mean you use that thing and so other people could get that from you.

maybe if you just index all the pubkey -> hash pairs, "jim mentioned this file" then that would work pretty well to find seeds for files. Maybe most apps would have a reasonable way to do this?

dominictarr commented 10 years ago

okay so the background for this has been much simplified by https://github.com/dominictarr/secure-scuttlebutt/pull/25

when the server sees a {$ext: hash} it decides if it want to replicate that, and requests it when replicating.

We could just use an http side channel - you can request a blob from a server: GET HOST/{hash} but a pub server would need to send a request message and then the client would need to do POST HOST/{hash} ... that is ugly and to be honest I'd much rather have file replication multiplexed within the replication protocol...

dominictarr commented 9 years ago

implemented.